本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新597篇论文，其中：

自然语言处理63篇
信息检索14篇
计算机视觉136篇

自然语言处理

1. 【2603.03249】Using Learning Progressions to Guide AI Feedback for Science Learning

作者：Xin Xia(1),Nejla Yuruk(2),Yun Wang(1),Xiaoming Zhai(1) ((1) University of Georgia, (2) Gazi University)

类目：Computation and Language (cs.CL)

关键词：Generative artificial intelligence, Generative artificial, offers scalable support, artificial intelligence, domain experts

备注： 15pages, 4 figures

点击查看摘要

Abstract:Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

2. 【2603.03242】Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

链接：https://arxiv.org/abs/2603.03242

作者：Patrick Gerard,Svitlana Volkova

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：vary across social, domain-specific contexts, Language models deployed, online communities, preference

备注： 27 Pages

点击查看摘要

Abstract:Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities -- particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics -- where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior.

Comments:
27 Pages

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.03242 [cs.AI]

(or
arXiv:2603.03242v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.03242

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

3. 【2603.03206】Understanding and Mitigating Dataset Corruption in LLM Steering

链接：https://arxiv.org/abs/2603.03206

作者：Cullen Anderson,Narmeen Oozeer,Foad Namjoo,Remy Ogasawara,Amirali Abdullah,Jeff M. Phillips

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Contrastive steering, inference time, simple and effective, effective method, method to adjust

备注：

点击查看摘要

Abstract:Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.

4. 【2603.03205】Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

链接：https://arxiv.org/abs/2603.03205

作者：Aradhye Agarwal,Gurdit Siyan,Yash Pandya,Joykirat Singh,Akshay Nambi,Ahmed Awadallah

类目：Computation and Language (cs.CL)

关键词：execute long-horizon actions, language models operate, single misstep, entering credentials, irreversible harm

备注： 24 pages, 5 figures

点击查看摘要

Abstract:Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

5. 【2603.03203】No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

链接：https://arxiv.org/abs/2603.03203

作者：Omer Sela

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：output Distribution, model sampled outputs, sampled outputs, measuring the peakedness, Distribution

备注： 8 pages main text, 5 pages appendix, 9 figures, 7 tables. Code available at [this https URL](https://github.com/Sela-Omer/Contamination-Detection-Small-LM)

点击查看摘要

Abstract:CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine-tuning produces verbatim memorization. With low-rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine-tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter-efficient fine-tuning can produce contamination that output-distribution methods do not detect. Our code is available at this https URL

6. 【2603.03202】Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

链接：https://arxiv.org/abs/2603.03202

作者：Dadi Guo,Yuejin Xie,Qingyu Liu,Jiayu Liu,Zhiyuan Fan,Qihan Ren,Shuai Shao,Tianyi Zhou,Dongrui Liu,Yi R. Fung

类目：Computation and Language (cs.CL)

关键词：large language models, IMO level, language models, significant bottleneck, large language

备注： Under review in ICML 2026

点击查看摘要

Abstract:As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at this https URL.

7. 【2603.03198】ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

链接：https://arxiv.org/abs/2603.03198

作者：Ziyang Gong,Zehang Luo,Anke Tang,Zhe Liu,Shi Fu,Zhi Hou,Ganlin Yang,Weiyun Wang,Xiaofeng Wang,Jianbo Liu,Gen Luo,Haolan Kang,Shuang Luo,Yue Zhou,Yong Luo,Li Shen,Xiaosong Jia,Yao Mu,Xue Yang,Chunxiao Liu,Junchi Yan,Hengshuang Zhao,Dacheng Tao,Xiaogang Wang

类目：Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：demands robust generalization, unmanned aerial vehicles, intelligence demands robust, demands robust, unmanned aerial

备注： Code: [this https URL](https://github.com/ACE-BRAIN-Team/ACE-Brain-0) Hugging Face: [this https URL](https://huggingface.co/ACE-Brain/ACE-Brain-0-8B)

点击查看摘要

Abstract:Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model's comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.

8. 【2603.03194】BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

链接：https://arxiv.org/abs/2603.03194

作者：Guoxin Chen,Fanzhe Meng,Jiale Zhao,Minghao Li,Daixuan Cheng,Huatong Song,Jie Chen,Yuzhi Lin,Hui Chen,Xin Zhao,Ruihua Song,Chang Liu,Cheng Chen,Kai Jia,Ji-Rong Wen

类目：Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：domain-specialized problem solving, primarily assess narrow, overlooking critical real-world, critical real-world challenges, agents primarily assess

备注： Benchmark: [this https URL](https://huggingface.co/datasets/AweAI-Team/BeyondSWE) . Repo: [this https URL](https://github.com/AweAI-Team/BeyondSWE) . Scaffold: [this https URL](https://github.com/AweAI-Team/AweAgent)

点击查看摘要

Abstract:Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

9. 【2603.03192】MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

链接：https://arxiv.org/abs/2603.03192

作者：Ashutosh Chaubey,Jiacheng Pang,Mohammad Soleymani

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Omni-modal large language, recently achieved strong, achieved strong performance, remain highly susceptible, dominant language priors

备注： CVPR 2026. Project Page: [this https URL](https://mod-dpo.github.io/)

点击查看摘要

Abstract:Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

10. 【2603.03180】ype-Aware Retrieval-Augmented Generation with Dependency Closure for Solver-Executable Industrial Optimization Modeling

链接：https://arxiv.org/abs/2603.03180

作者：Y. Zhong,R. Huang,M. Wang,Z. Guo,YC. Li,M. Yu,Z. Jin

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：requires reliable translation, modeling requires reliable, Automated industrial optimization, Automated industrial, requires reliable

备注：

点击查看摘要

Abstract:Automated industrial optimization modeling requires reliable translation of natural-language requirements into solver-executable code. However, large language models often generate non-compilable models due to missing declarations, type inconsistencies, and incomplete dependency contexts. We propose a type-aware retrieval-augmented generation (RAG) method that enforces modeling entity types and minimal dependency closure to ensure executability. Unlike existing RAG approaches that index unstructured text, our method constructs a domain-specific typed knowledge base by parsing heterogeneous sources, such as academic papers and solver code, into typed units and encoding their mathematical dependencies in a knowledge graph. Given a natural-language instruction, it performs hybrid retrieval and computes a minimal dependency-closed context, the smallest set of typed symbols required for solver-executable code, via dependency propagation over the graph. We validate the method on two constraint-intensive industrial cases: demand response optimization in battery production and flexible job shop scheduling. In the first case, our method generates an executable model incorporating demand-response incentives and load-reduction constraints, achieving peak shaving while preserving profitability; conventional RAG baselines fail. In the second case, it consistently produces compilable models that reach known optimal solutions, demonstrating robust cross-domain generalization; baselines fail entirely. Ablation studies confirm that enforcing type-aware dependency closure is essential for avoiding structural hallucinations and ensuring executability, addressing a critical barrier to deploying large language models in complex engineering optimization tasks.

11. 【2603.03142】APRES: An Agentic Paper Revision and Evaluation System

链接：https://arxiv.org/abs/2603.03142

作者：Bingchen Zhao,Jenny Zhang,Chenxi Whitehouse,Minqi Jiang,Michael Shvartsman,Abhishek Charnalia,Despoina Magka,Tatiana Shavrina,Derek Dunfield,Oisin Mac Aodha,Yoram Bachrach

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：realize their full, Large Language Models, full potential, APRES, Language Models

备注：

点击查看摘要

Abstract:Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.

12. 【2603.03134】UniSkill: A Dataset for Matching University Curricula to Professional Competencies

链接：https://arxiv.org/abs/2603.03134

作者：Nurlan Musazade,Joszef Mezei,Mike Zhang

类目：Computation and Language (cs.CL)

关键词：Organization Analyst ESCO, studied from recruiter, education perspectives, Analyst ESCO occupation, Skill extraction

备注： LREC 2026

点击查看摘要

Abstract:Skill extraction and recommendation systems have been studied from recruiter, applicant, and education perspectives. While AI applications in job advertisements have received broad attention, deficiencies in the instructed skills side remain a challenge. In this work, we address the scarcity of publicly available datasets by releasing both manually annotated and synthetic datasets of skills from the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and university course pairs and publishing corresponding annotation guidelines. Specifically, we match graduate-level university courses with skills from the Systems Analysts and Management and Organization Analyst ESCO occupation groups at two granularities: course title with a skill, and course sentence with a skill. We train language models on this dataset to serve as a baseline for retrieval and recommendation systems for course-to-skill and skill-to-course matching. We evaluate the models on a portion of the annotated data. Our BERT model achieves 87% F1-score, showing that course and skill matching is a feasible task.

13. 【2603.03111】Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

链接：https://arxiv.org/abs/2603.03111

作者：Raad Khraishi,Iman Zafar,Katie Myles,Greig A Cowan

类目：Computation and Language (cs.CL)

关键词：Deployed multi-turn LLM, LLM systems routinely, routinely switch models, switch models mid-interaction, models mid-interaction due

备注：

点击查看摘要

Abstract:Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.

14. 【2603.03095】Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection

链接：https://arxiv.org/abs/2603.03095

作者：Sofiane Elguendouze,Erwan Hain,Elena Cabrio,Serena Villata

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：jointly delimiting argumentative, delimiting argumentative spans, requires jointly delimiting, Argumentative component detection, delimiting argumentative

备注： Under Review (COLM 2026)

点击查看摘要

Abstract:Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.

15. 【2603.03081】AO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

链接：https://arxiv.org/abs/2603.03081

作者：Zhi Xu,Jiaqi Li,Xiaotong Zhang,Hong Yu,Han Liu

类目：Computation and Language (cs.CL)

关键词：Large language models, elicit unsafe responses, attackers craft prompts, bypass safety alignment, Large language

备注：

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

16. 【2603.03072】kZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

链接：https://arxiv.org/abs/2603.03072

作者：Christian Greisinger,Steffen Eger

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large language models, Large language, diverse workflows, assist scientists, scientists across diverse

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

17. 【2603.03056】Incremental Graph Construction Enables Robust Spectral Clustering of Texts

链接：https://arxiv.org/abs/2603.03056

作者：Marko Pranjić,Boshko Koloski,Nada Lavrač,Senja Pollak,Marko Robnik-Šikonja

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Neighborhood graphs, spectral clustering, making spectral clustering, Massive Text Embedding, fragile step

备注： MP and BK contributed equally

点击查看摘要

Abstract:Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding this http URL to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.

18. 【2603.03054】PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

链接：https://arxiv.org/abs/2603.03054

作者：Sudip Bhujel

类目：Computation and Language (cs.CL)

关键词：Large language models, patient-facing medical assistance, requires supervision derived, clinical decision support, Large language

备注：

点击查看摘要

Abstract:Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at this https URL.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.03054 [cs.CL]

(or
arXiv:2603.03054v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.03054

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

19. 【2603.03047】rustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

链接：https://arxiv.org/abs/2603.03047

作者：Zixin Xiong,Ziteng Wang,Haotian Fan,Xinjie Zhang,Wenxuan Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, practical deployment raises, providing accessible mental, trustworthiness concerns due

备注：

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.

20. 【2603.03001】MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

链接：https://arxiv.org/abs/2603.03001

作者：Jinwoong Kim,Sangjin Park

类目：Computation and Language (cs.CL)

关键词：Bidirectional Encoder Representations, Bidirectional Encoder, Encoder Representations, scale quadratically, Representations from Transformers

备注： 8 pages

点击查看摘要

Abstract:Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.

21. 【2603.02983】Contextualized Privacy Defense for LLM Agents

链接：https://arxiv.org/abs/2603.02983

作者：Yule Wen,Yanzhe Zhang,Jianxun Lian,Xiaoyuan Yi,Xing Xie,Diyi Yang

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：LLM agents increasingly, users' personal information, defenses remain limited, agents increasingly act, LLM agents

备注： 25 pages

点击查看摘要

Abstract:LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution. We propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step-specific, context-aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience-driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy-helpfulness trade-offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.

22. 【2603.02945】ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

链接：https://arxiv.org/abs/2603.02945

作者：Bo Xu,Haotian Wu,Hehai Lin,Weiquan Huang,Beier Zhu,Yao Shu,Chengwei Qin

类目：Computation and Language (cs.CL)

关键词：combine multiple task-specific, multiple task-specific expert, task-specific expert models, aims to combine, combine multiple

备注： Accepted to CVPR 2026 (Main Track)

点击查看摘要

Abstract:Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

23. 【2603.02909】Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

链接：https://arxiv.org/abs/2603.02909

作者：Guangjun Zhang,Hu Zhang,Yazhou Han,Yue Fan,Yuhang Shao,Ru Li,Hongye Tan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Document-level event argument, existing methods employ, methods employ LLMs, generate synthetic data, event argument extraction

备注： Accepted by AAAI 2026

点击查看摘要

Abstract:Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents . In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of "Propose-Evaluate-Revise." Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement this http URL three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.

24. 【2603.02876】Eval4Sim: An Evaluation Framework for Persona Simulation

链接：https://arxiv.org/abs/2603.02876

作者：Eliseo Bao,Anxo Perez,Xi Wang,Javier Parapar

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, Language Model, behavioural analysis, social reasoning

备注：

点击查看摘要

Abstract:Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.

25. 【2603.02873】LaTeX Compilation: Challenges in the Era of LLMs

链接：https://arxiv.org/abs/2603.02873

作者：Tianyou Liu,Ziqiang Li,Yansong Li,Xurui Liu

类目：Computation and Language (cs.CL)

关键词：large language models, assist scientific writing, increasingly assist scientific, significant token cost, language models

备注： 25 pages, 12 figures

点击查看摘要

Abstract:As large language models (LLMs) increasingly assist scientific writing, limitations and the significant token cost of TeX become more and more visible. This paper analyzes TeX's fundamental defects in compilation and user experience design to illustrate its limitations on compilation efficiency, generated semantics, error localization, and tool ecosystem in the era of LLMs. As an alternative, Mogan STEM, a WYSIWYG structured editor, is introduced. Mogan outperforms TeX in the above aspects by its efficient data structure, fast rendering, and on-demand plugin loading. Extensive experiments are conducted to verify the benefits on compilation/rendering time and performance in LLM tasks. What's more, we show that due to Mogan's lower information entropy, it is more efficient to use .tmu (the document format of Mogan) to fine-tune LLMs than TeX. Therefore, we launch an appeal for larger experiments on LLM training using the .tmu format.

26. 【2603.02865】Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

链接：https://arxiv.org/abs/2603.02865

作者：Haruto Yoshida,Keito Kudo,Yoichi Aoki,Ryota Tanaka,Itsumi Saito,Keisuke Sakaguchi,Kentaro Inui

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, demonstrate strong performance, Large vision-language, diagram understanding benchmarks, arrows and lines

备注：

点击查看摘要

Abstract:Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

27. 【2603.02860】he Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models

链接：https://arxiv.org/abs/2603.02860

作者：Fermín Moscoso del Prado Martín,Suchir Salhan

类目：Computation and Language (cs.CL)

关键词：microscopic levels, macroscopic and microscopic, symmetric Dirichlet distribution, Maximum Entropy model, Dirichlet distribution

备注：

点击查看摘要

Abstract:We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.

28. 【2603.02842】A Browser-based Open Source Assistant for Multimodal Content Verification

链接：https://arxiv.org/abs/2603.02842

作者：Rosanna Milner,Michael Foster,Olesya Razuvayevskaya,Ian Roberts,Valentin Porcellini,Denis Teyssou,Kalina Bontcheva

类目：Computation and Language (cs.CL)

关键词：rapidly verify digital, digital media information, verify digital media, false content produced, VERIFICATION ASSISTANT

备注：

点击查看摘要

Abstract:Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.

29. 【2603.02830】Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

链接：https://arxiv.org/abs/2603.02830

作者：Prarthana Bhattacharyya,Joshua Mitton,Ralph Abboud,Simon Woodhead

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：enables effective interventions, educational learning platforms, effective interventions, models, learning platforms

备注： 7 pages, 6 figures. Prarthana Bhattacharyya and Joshua Mitton contributed equally to this work

点击查看摘要

Abstract:Predicting future student responses to questions is particularly valuable for educational learning platforms where it enables effective interventions. One of the key approaches to do this has been through the use of knowledge tracing (KT) models. These are small, domain-specific, temporal models trained on student question-response data. KT models are optimised for high accuracy on specific educational domains and have fast inference and scalable deployments. The rise of Large Language Models (LLMs) motivates us to ask the following questions: (1) How well can LLMs perform at predicting students' future responses to questions? (2) Are LLMs scalable for this domain? (3) How do LLMs compare to KT models on this domain-specific task? In this paper, we compare multiple LLMs and KT models across predictive performance, deployment cost, and inference speed to answer the above questions. We show that KT models outperform LLMs with respect to accuracy and F1 scores on this domain-specific task. Further, we demonstrate that LLMs are orders of magnitude slower than KT models and cost orders of magnitude more to deploy. This highlights the importance of domain-specific models for education prediction tasks and the fact that current closed source LLMs should not be used as a universal solution for all tasks.

30. 【2603.02798】Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

链接：https://arxiv.org/abs/2603.02798

作者：Yichi Zhang,Nabeel Seedat,Yinpeng Dong,Peng Cui,Jun Zhu,Mihaela van de Schaar

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：facilitate trustworthy deployment, develop reliable verification, high-stakes decision-making, trustworthy deployment, critical to develop

备注：

点击查看摘要

Abstract:As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish GLEAN, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN's utility in practice.

31. 【2603.02789】OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

链接：https://arxiv.org/abs/2603.02789

作者：Jiyuan Shen,Peiyue Yuan,Atin Ghosh,Yifan Mai,Daniel Dahlmeier

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, natural language processing, Large Language Models, Multimodal Large, Language Models

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.

32. 【2603.02775】From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

链接：https://arxiv.org/abs/2603.02775

作者：Weikang Shi,Houxing Ren,Junting Pan,Aojun Zhou,Ke Wang,Zimu Lu,Yunqiao Yang,Yuxuan Hu,Linda Wei,Mingjie Zhan,Hongsheng Li

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, narrow pedagogical scenarios, multi-turn teaching effectiveness, Mathematical Pedagogical Benchmark

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

33. 【2603.02760】Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

链接：https://arxiv.org/abs/2603.02760

作者：Linhao Zhong,Linyu Wu,Wen Wang,Yuling Xi,Chenchen Jing,Jiaheng Zhang,Hao Chen,Chunhua Shen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Diffusion large language, recently attracted significant, attracted significant attention, Diffusion large, large language models

备注：

点击查看摘要

Abstract:Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

34. 【2603.02709】Sensory-Aware Sequential Recommendation via Review-Distilled Representations

链接：https://arxiv.org/abs/2603.02709

作者：Yeo Chan Yoon

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Attribute-based Sensory Enhanced, Sensory Enhanced Generative, Enhanced Generative Recommendation, framework for sensory-aware, Enhanced Generative

备注：

点击查看摘要

Abstract:We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textsc{ASEGR} (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute--value pairs, such as \textit{color: matte black} and \textit{scent: vanilla}, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.

35. 【2603.02701】Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

链接：https://arxiv.org/abs/2603.02701

作者：Yueyang Cang,Xiaoteng Zhang,Erlu Zhao,Zehua Ji,Yuhang Liu,Yuchen He,Zhiyuan Ning,Chen Yijun,Wenge Que,Li Shi

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Language Model, Large Language, based Multi-Agent Systems, Multi-Agent Systems

备注：

点击查看摘要

Abstract:Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

36. 【2603.02684】HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

链接：https://arxiv.org/abs/2603.02684

作者：Sai Kartheek Reddy Kasu,Shankar Biradar,Sunil Saumya,Md. Shad Akhtar

类目：Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词：Subtle and indirect, hate speech remains, indirect hate speech, online safety research, remains an underexplored

备注： Accepted at LREC 2026

点击查看摘要

Abstract:Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.

37. 【2603.02676】ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

链接：https://arxiv.org/abs/2603.02676

作者：Wicaksono Leksono Muhamad,Joanito Agili Lopo,Tack Hwa Wong,Muhammad Ravi Shulthan Habibi,Samuel Cahyawijaya

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, language models suffer, multi-lingual contexts, language models

备注：

点击查看摘要

Abstract:Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.

38. 【2603.02663】Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

链接：https://arxiv.org/abs/2603.02663

作者：Shunki Uebayashi,Kento Masui,Kyohei Atarashi,Han Bao,Hisashi Kashima,Naoto Inoue,Mayu Otani,Koh Takeuchi

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Large Language, general architectures capable, Language Models

备注： 24pages, 20 figures, accepted to ICLR2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

39. 【2603.02655】Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

链接：https://arxiv.org/abs/2603.02655

作者：Anum Afzal,Yuki Saito,Hiroya Takamura,Katsuhito Sudoh,Shinnosuke Takamichi,Graham Neubig,Florian Matthes,Tatsuya Ishigaki

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：textual descriptions, descriptions of ongoing, ongoing events, commentary generation, Real-time video commentary

备注： Accepted at LREC2026

点击查看摘要

Abstract:Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.

40. 【2603.02640】Credibility Governance: A Social Mechanism for Collective Self-Correction under Weak Truth Signals

链接：https://arxiv.org/abs/2603.02640

作者：Wanying He,Yanxi Lin,Ziheng Zhou,Xue Feng,Min Peng,Qianqian Xie,Zilong Zheng,Yipeng Kang

类目：Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)

关键词：Online platforms increasingly, platforms increasingly rely, allocate real-world attention, Online platforms, attention and resources

备注：

点击查看摘要

Abstract:Online platforms increasingly rely on opinion aggregation to allocate real-world attention and resources, yet common signals such as engagement votes or capital-weighted commitments are easy to amplify and often track visibility rather than reliability. This makes collective judgments brittle under weak truth signals, noisy or delayed feedback, early popularity surges, and strategic manipulation. We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence. CG maintains dynamic credibility scores for both agents and opinions, updates opinion influence via credibility-weighted endorsements, and updates agent credibility based on the long-run performance of the opinions they support, rewarding early and persistent alignment with emerging evidence while filtering short-lived noise. We evaluate CG in POLIS, a socio-physical simulation environment that models coupled belief dynamics and downstream feedback under uncertainty. Across settings with initial majority misalignment, observation noise and contamination, and misinformation shocks, CG outperforms vote-based, stake-weighted, and no-governance baselines, yielding faster recovery to the true state, reduced lock-in and path dependence, and improved robustness under adversarial pressure. Our implementation and experimental scripts are publicly available at this https URL.

41. 【2603.02637】StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

链接：https://arxiv.org/abs/2603.02637

作者：Shiyang Li,Zijian Zhang,Winson Chen,Yuebo Luo,Mingyi Hong,Caiwen Ding

类目：Multiagent Systems (cs.MA); Computation and Language (cs.CL); Programming Languages (cs.PL)

关键词：Modern machine learning, workloads increasingly rely, remains challenging due, Modern machine, GPU kernel efficiency

备注：

点击查看摘要

Abstract:Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.

Subjects:

Multiagent Systems (cs.MA); Computation and Language (cs.CL); Programming Languages (cs.PL)

Cite as:
arXiv:2603.02637 [cs.MA]

(or
arXiv:2603.02637v1 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2603.02637

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

42. 【2603.02631】Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

链接：https://arxiv.org/abs/2603.02631

作者：Shubhangi Upasani,Ravi Shanker Raju,Bo Li,Mengmeing Ji,John Long,Chen Wu,Urmish Thakker,Guangtao Wang

类目：Computation and Language (cs.CL)

关键词：multi-call loops incur, large language model, agentic large language, loops incur substantial, prompt compression

备注：

点击查看摘要

Abstract:Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.

43. 【2603.02615】hink, But Don't Overthink: Reproducing Recursive Language Models

链接：https://arxiv.org/abs/2603.02615

作者：Daren Wang

类目：Computation and Language (cs.CL)

关键词：Recursive Language Models, Recursive Language, Large Language Models, Language Models, recently proposed

备注：

点击查看摘要

Abstract:This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: this https URL

44. 【2603.02597】GPUTOK: GPU Accelerated Byte Level BPE Tokenization

链接：https://arxiv.org/abs/2603.02597

作者：Venu Gopal Kadamba,Kanishkha Jaisankar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

关键词：million-token context windows, GPUs sit unused, powerful GPUs sit, large language models, language models move

备注：

点击查看摘要

Abstract:As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Cite as:
arXiv:2603.02597 [cs.CL]

(or
arXiv:2603.02597v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.02597

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

45. 【2603.02588】ExpGuard: LLM Content Moderation in Specialized Domains

链接：https://arxiv.org/abs/2603.02588

作者：Minseok Choi,Dongjin Kim,Seungbin Yang,Subin Kim,Youngjun Kwak,Juyoung Oh,Jaegul Choo,Jungmin Son

类目：Computation and Language (cs.CL)

关键词：establishing robust safety, large language models, robust safety guardrails, safety policies, real-world applications

备注： ICLR 2026

点击查看摘要

Abstract:With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

46. 【2603.02578】How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

链接：https://arxiv.org/abs/2603.02578

作者：Ziwen Xu,Kewei Xu,Haoming Xu,Haiwen Hong,Longtao Huang,Hui Xue,Ningyu Zhang,Yongliang Shen,Guozhou Zheng,Huajun Chen,Shumin Deng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：Large Language Models, pose significant risks, Large Language, Language Models, socially sensitive domains

备注： Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

47. 【2603.02565】FlashEvaluator: Expanding Search Space with Parallel Evaluation

链接：https://arxiv.org/abs/2603.02565

作者：Chao Feng,Yuanhao Pu,Chenghao Zhang,Shanqi Liu,Shuchang Liu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Natural Language Processing, Language Processing, Natural Language, generator and selecting, selecting the top-ranked

备注： 23 pages, 2 figures

点击查看摘要

Abstract:The Generator-Evaluator (G-E) framework, i.e., evaluating K sequences from a generator and selecting the top-ranked one according to evaluator scores, is a foundational paradigm in tasks such as Recommender Systems (RecSys) and Natural Language Processing (NLP). Traditional evaluators process sequences independently, suffering from two major limitations: (1) lack of explicit cross-sequence comparison, leading to suboptimal accuracy; (2) poor parallelization with linear complexity of O(K), resulting in inefficient resource utilization and negative impact on both throughput and latency. To address these challenges, we propose FlashEvaluator, which enables cross-sequence token information sharing and processes all sequences in a single forward pass. This yields sublinear computational complexity that improves the system's efficiency and supports direct inter-sequence comparisons that improve selection accuracy. The paper also provides theoretical proofs and extensive experiments on recommendation and NLP tasks, demonstrating clear advantages over conventional methods. Notably, FlashEvaluator has been deployed in online recommender system of Kuaishou, delivering substantial and sustained revenue gains in practice.

48. 【2603.02556】hrough the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

链接：https://arxiv.org/abs/2603.02556

作者：Zhiyu Pan,Yizheng Wu,Jiashen Hua,Junyi Feng,Shaotian Yan,Bing Deng,Zhiguo Cao,Jieping Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Reasoning, large language models, visual, visual reasoning, reasoning paths

备注： 19 pages, 9 figures, accepted to ICLR 2026 (oral)

点击查看摘要

Abstract:Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: this https URL.

49. 【2603.02547】CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

链接：https://arxiv.org/abs/2603.02547

作者：Junzhe Shen,Jieru Zhao,Ziwei He,Zhouhan Lin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：continuous generative dynamics, appealing continuous generative, diffusion language models, language models, generative dynamics

备注：

点击查看摘要

Abstract:We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.

50. 【2603.02482】MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

链接：https://arxiv.org/abs/2603.02482

作者：Zhongxi Wang,Yueqian Lin,Jingyang Zhang,Hai Helen Li,Yiran Chen

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：remain predominantly text-centric, Unified Safety Evaluation, large language models, language models remain, models remain predominantly

备注： Submitted to ACL 2026 System Demonstration Track

点击查看摘要

Abstract:Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

51. 【2603.02464】GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

链接：https://arxiv.org/abs/2603.02464

作者：Pouya Mehralian,Melissa Farasyn,Anne Breitbarth,Anne-Sophie Ghyselen,Hugo Van hamme

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Automatic Speech Recognition, Automatic Speech, Speech Recognition, limited labeled data, dialect-heavy settings remains

备注： Accepted to ICASSP 2026. 5 pages

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. GLoRIA injects low-rank matrices into each feed-forward layer, with a gating MLP determining the non-negative contribution of each LoRA rank-1 component based on location metadata. On the GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. GLoRIA also generalizes well to unseen dialects, including in extrapolation scenarios, and enables interpretable adaptation patterns that can be visualized geospatially. These results show metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR.

52. 【2603.02422】A Directed Graph Model and Experimental Framework for Design and Study of Time-Dependent Text Visualisation

链接：https://arxiv.org/abs/2603.02422

作者：Songhai Fan,Simon Angus,Tim Dwyer,Ying Yang,Sarah Goodwin,Helen Purchase

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：textual sources makes, rapidly evolving narratives, Exponential growth, social media, world events

备注： preprint version for TVCG submission

点击查看摘要

Abstract:Exponential growth in the quantity of digital news, social media, and other textual sources makes it difficult for humans to keep up with rapidly evolving narratives about world events. Various visualisation techniques have been touted to help people to understand such discourse by exposing relationships between texts (such as news articles) as topics and themes evolve over time. Arguably, the understandability of such visualisations hinges on the assumption that people will be able to easily interpret the relationships in such visual network structures. To test this assumption, we begin by defining an abstract model of time-dependent text visualisation based on directed graph structures. From this model we distill motifs that capture the set of possible ways that texts can be linked across changes in time. We also develop a controlled synthetic text generation methodology that leverages the power of modern LLMs to create fictional, yet structured sets of time-dependent texts that fit each of our patterns. Therefore, we create a clean user study environment (n=30) for participants to identify patterns that best represent a given set of synthetic articles. We find that it is a challenging task for the user to identify and recover the predefined motif. We analyse qualitative data to map an unexpectedly rich variety of user rationales when divergences from expected interpretation occur. A deeper analysis also points to unexpected complexities inherent in the formation of synthetic datasets with LLMs that undermine the study control in some cases. Furthermore, analysis of individual decision-making in our study hints at a future where text discourse visualisation may need to dispense with a one-size-fits-all approach and, instead, should be more adaptable to the specific user who is exploring the visualisation in front of them.

53. 【2603.02368】RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks

链接：https://arxiv.org/abs/2603.02368

作者：Alexandra Diaconu,Mădălina Vînaga,Bogdan Alexe

类目：Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词：benchmark Romanian speech, Romanian speech dataset, automatic speech recognition, speech dataset designed, benchmark Romanian

备注：

点击查看摘要

Abstract:We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.

54. 【2603.02353】Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

链接：https://arxiv.org/abs/2603.02353

作者：Jiangang Hao

类目：Computation and Language (cs.CL)

关键词：fosters critical thinking, underpins effective communication, articulate complex ideas, foundational literacy skill, effective communication

备注： 21 pages, 2 figures

点击查看摘要

Abstract:Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.

55. 【2603.02333】Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

链接：https://arxiv.org/abs/2603.02333

作者：Xiaoyu Luo,Wenrui Yu,Qiongxiu Li,Johannes Bjerva

类目：Computation and Language (cs.CL)

关键词：occasionally reproduce training, raising concerns, copyright liability, Autoregressive language models, training data verbatim

备注： 21 pages, 9 figures

点击查看摘要

Abstract:Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.

56. 【2603.02258】Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry

链接：https://arxiv.org/abs/2603.02258

作者：Kyle Elliott Mathewson

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：neural machine translation, translation models learn, models learn language-universal, machine translation models, learn language-universal conceptual

备注： 14 figures; code and interactive toolkit available at [this https URL](https://github.com/kylemathewson/InterpretCognates)

点击查看摘要

Abstract:Do neural machine translation models learn language-universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program ($\rho = 0.13$, $p = 0.020$), demonstrating that NLLB-200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non-colexified pairs ($U = 42656$, $p = 1.33 \times 10^{-11}$, $d = 0.96$), indicating that the model has internalized universal conceptual associations. Per-language mean-centering of embeddings improves the between-concept to within-concept distance ratio by a factor of 1.19, providing geometric evidence for a language-neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross-lingual consistency (mean cosine = 0.84), suggesting that second-order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.

57. 【2603.02248】HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval

链接：https://arxiv.org/abs/2603.02248

作者：Sungho Park,Joohyung Yun,Jongwuk Lee,Wook-Shin Han

类目：Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：open-domain question answering, Table-text retrieval aims, support open-domain question, question answering, Table-text retrieval

备注： 9 pages, 6 figures. Accepted at ACL 2025 main. Project page: [this https URL](https://helios-projectpage.github.io/)

点击查看摘要

Abstract:Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming "stars," which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6\% and 39.9\% in recall and nDCG, respectively, on the OTT-QA benchmark.

58. 【2603.02229】Safety Training Persists Through Helpfulness Optimization in LLM Agents

链接：https://arxiv.org/abs/2603.02229

作者：Benjamin Plaut

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：refusing harmful requests, extensively in single-step, safety typically refers, studied extensively, harmful requests

备注： Under submission

点击查看摘要

Abstract:Safety post-training has been studied extensively in single-step "chat" settings where safety typically refers to refusing harmful requests. We study an "agentic" (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with $R^2 = 0.77$. Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a "best of both worlds" strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics.

59. 【2603.02227】Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

链接：https://arxiv.org/abs/2603.02227

作者：Keston Aquino-Michaels

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：attention entries matter, entries matter, matter during training, attention, transformer learn

备注： 14 pages, 4 figures

点击查看摘要

Abstract:Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model's Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.

60. 【2603.02218】Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

链接：https://arxiv.org/abs/2603.02218

作者：Wei Liu,Siya Qi,Yali Du,Yulan He

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词：Large language models, Large language, language models, make it plausible, plateau quickly

备注： 10 pages, 6 figures, 7 formulas

点击查看摘要

Abstract:Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

61. 【2603.02213】A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

链接：https://arxiv.org/abs/2603.02213

作者：Marcelo A. Montemurro,Mirko Degli Esposti

类目：Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Genomics (q-bio.GN)

关键词：DNA display characteristic, display characteristic frequency, genomic DNA display, characteristic frequency distributions, long-range correlations extending

备注：

点击查看摘要

Abstract:Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.

Subjects:

Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Genomics (q-bio.GN)

Cite as:
arXiv:2603.02213 [cs.CL]

(or
arXiv:2603.02213v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.02213

Focus to learn more

              arXiv-issued DOI via DataCite

Journalreference:
Physica A 683 (2026) 131227

Focus to learn more

            DOI(s) linking to related resources</p>

62. 【2504.21023】Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

链接：https://arxiv.org/abs/2504.21023

作者：Sheng Cao,Mingrui Wu,Karthik Prasad,Yuandong Tian,Zechun Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Theta, Theta text, large language models, Param, Delta

备注： Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces $Param\Delta$, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ($\Theta_\text{post}$) and base model weights ($\Theta_\text{base}$), and adding this to the updated base model ($\Theta'_\text{base}$), we define $Param\Delta$ Model as: $\Theta_{\text{Param}\Delta} = \Theta_\text{post} - \Theta_\text{base} + \Theta'_\text{base}$. This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate $Param\Delta$ Model effectively replicates traditional post-training. For example, the $Param\Delta$ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95\% of Llama3.1-inst model's performance on average. $Param\Delta$ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.

63. 【2603.03096】Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

链接：https://arxiv.org/abs/2603.03096

作者：Kyle Janse van Rensburg,Benjamin van Niekerk,Herman Kamper

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词：self-supervised learning structure, speech models trained, models trained, trained through self-supervised, self-supervised learning

备注： 5 pages, 7 figures, submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in synthesis experiments we show that most characteristics can be controlled by changing the corresponding dimensions. This provides a simple method to control characteristics of the output voice in synthesis applications.

信息检索

1. 【2603.03126】he Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

链接：https://arxiv.org/abs/2603.03126

作者：Jonas Wilinski

类目：Digital Libraries (cs.DL); Databases (cs.DB); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词：Science Data Lake, Scholarly data, largely fragmented, fragmented across siloed, divergent metadata

备注： 18 pages, 8 figures, 7 tables. Dataset DOI: [https://doi.org/10.57967/hf/7850](https://doi.org/10.57967/hf/7850) . Code: [this https URL](https://github.com/J0nasW/science-datalake)

点击查看摘要

Abstract:Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

2. 【2603.03094】Proactive Guiding Strategy for Item-side Fairness in Interactive Recommendation

链接：https://arxiv.org/abs/2603.03094

作者：Chongjun Xia,Xiaoyu Shi,Hong Xie,Xianzhi Wang,yun lu,Mingsheng Shang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Item-side fairness, interactive recommender systems, long-tail items, recommender systems, fairness is crucial

备注：

点击查看摘要

Abstract:Item-side fairness is crucial for ensuring the fair exposure of long-tail items in interactive recommender systems. Existing approaches promote the exposure of long-tail items by directly incorporating them into recommended results. This causes misalignment between user preferences and the recommended long-tail items, which hinders long-term user engagement and reduces the effectiveness of recommendations. We aim for a proactive fairness-guiding strategy, which actively guides user preferences toward long-tail items while preserving user satisfaction during the interactive recommendation process. To this end, we propose HRL4PFG, an interactive recommendation framework that leverages hierarchical reinforcement learning to guide user preferences toward long-tail items progressively. HRL4PFG operates through a macro-level process that generates fairness-guided targets based on multi-step feedback, and a micro-level process that fine-tunes recommendations in real time according to both these targets and evolving user preferences. Extensive experiments show that HRL4PFG improves cumulative interaction rewards and maximum user interaction length by a larger margin when compared with state-of-the-art methods in interactive recommendation environments.

3. 【2603.03010】Reproducing and Comparing Distillation Techniques for Cross-Encoders

链接：https://arxiv.org/abs/2603.03010

作者：Victor Morand,Mathias Vast,Basile Van Cooten,Laure Soulier,Josiane Mothe,Benjamin Piwowarski

类目：Information Retrieval (cs.IR)

关键词：Information Retrieval, established transformer-based cross-encoders, advances in Information, Retrieval have established, established transformer-based

备注：

点击查看摘要

Abstract:Recent advances in Information Retrieval have established transformer-based cross-encoders as a keystone in IR. Recent studies have focused on knowledge distillation and showed that, with the right strategy, traditional cross-encoders could reach the level of effectiveness of LLM re-rankers. Yet, comparisons with previous training strategies, including distillation from strong cross-encoder teachers, remain unclear. In addition, few studies cover a similar range of backbone encoders, while substantial improvements have been made in this area since BERT. This lack of comprehensive studies in controlled environments makes it difficult to identify robust design choices. In this work, we reproduce \citet{schlattRankDistiLLMClosingEffectiveness2025} LLM-based distillation strategy and compare it to \citet{hofstatterImprovingEfficientNeural2020} approach based on an ensemble of cross-encoder teachers, as well as other supervised objectives, to fine-tune a large range of cross-encoders, from the original BERT and its follow-ups RoBERTa, ELECTRA and DeBERTa-v3, to the more recent ModernBERT. We evaluate all models on both in-domain (TREC-DL and MS~MARCO dev) and out-of-domain datasets (BEIR, LoTTE, and Robust04). Our results show that objectives emphasizing relative comparisons -- pairwise MarginMSE and listwise InfoNCE -- consistently outperform pointwise baselines across all backbones and evaluation settings, and that objective choice can yield gains comparable to scaling the backbone architecture.

4. 【2603.02999】OneRanker: Unified Generation and Ranking with One Model in Industrial Advertising Recommendation

链接：https://arxiv.org/abs/2603.02999

作者：Dekai Sun,Yiming Liu,Jiafan Zhou,Xun Liu,Chenchen Yu,Yi Li,Huan Yu,Jun Zhang

类目：Information Retrieval (cs.IR)

关键词：traditional cascaded architectures, driving a shift, unified modeling, shift from traditional, traditional cascaded

备注：

点击查看摘要

Abstract:The end-to-end generative paradigm is revolutionizing advertising recommendation systems, driving a shift from traditional cascaded architectures towards unified modeling. However, practical deployment faces three core challenges: the misalignment between interest objectives and business value, the target-agnostic limitation of generative processes, and the disconnection between generation and ranking stages. Existing solutions often fall into a dilemma where single-stage fusion induces optimization tension, while stage decoupling causes irreversible information loss. To address this, we propose OneRanker, achieving architectural-level deep integration of generation and ranking. First, we design a value-aware multi-task decoupling architecture. By leveraging task token sequences and causal mask, we separate interest coverage and value optimization spaces within shared representations, effectively alleviating target conflicts. Second, we construct a coarse-to-fine collaborative target awareness mechanism, utilizing Fake Item Tokens for implicit awareness during generation and a ranking decoder for explicit value alignment at the candidate level. Finally, we propose input-output dual-side consistency guarantees. Through Key/Value pass-through mechanisms and Distribution Consistency (DC) Constraint Loss, we achieve end-to-end collaborative optimization between generation and ranking. The full deployment on Tencent's WeiXin channels advertising system has shown a significant improvement in key business metrics (GMV - Normal +1.34\%), providing a new paradigm with industrial feasibility for generative advertising recommendations.

5. 【2603.02941】mehash: Hierarchical Time Indexing for Efficient Business Hours Search

链接：https://arxiv.org/abs/2603.02941

作者：Jinoh Kim,Jaewon Son

类目：Databases (cs.DB); Information Retrieval (cs.IR)

关键词：operating hours, Temporal range filtering, critical operation, operation in large-scale, filter businesses

备注： 12 pages, 2 figures, 8 tables. Submitted to VLDB 2026 Industry Track

点击查看摘要

Abstract:Temporal range filtering is a critical operation in large-scale search systems, particularly for location-based services that need to filter businesses by operating hours. Traditional approaches either suffer from poor query performance (scope filtering) or index size explosion (minute-level indexing). We present Timehash, a novel hierarchical time indexing algorithm that achieves over 99% reduction in index size compared to minute-level indexing while maintaining 100% precision. Timehash employs a flexible multi-resolution strategy with customizable hierarchical levels. Through empirical analysis on distributions from 12.6 million business records of a production location search service, we demonstrate a data-driven methodology for selecting optimal hierarchies tailored to specific data distributions. We evaluated Timehash on up to 12.6 million synthetic POIs generated from production distributions. Experimental results show that a five-level hierarchy reduces index terms to 5.6 per document (99.1% reduction versus minute-level indexing), with zero false positives and zero false negatives. Scalability benchmarks confirm constant per-document cost from 100K to 12.6M POIs, while supporting complex scenarios such as break times and irregular schedules. Our approach is generalizable to various temporal filtering problems in search systems, e-commerce, and reservation platforms.

Comments:
12 pages, 2 figures, 8 tables. Submitted to VLDB 2026 Industry Track

Subjects:

Databases (cs.DB); Information Retrieval (cs.IR)

ACMclasses:
H.3.1; H.3.3

Cite as:
arXiv:2603.02941 [cs.DB]

(or
arXiv:2603.02941v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2603.02941

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

6. 【2603.02773】Model Editing for New Document Integration in Generative Information Retrieval

链接：https://arxiv.org/abs/2603.02773

作者：Zhen Zhang,Zihan Wang,Xinyu Ma,Shuaiqiang Wang,Dawei Yin,Xin Xin,Pengjie Ren,Maarten de Rijke,Zhaochun Ren

类目：Information Retrieval (cs.IR)

关键词：reformulates the Information, Generative retrieval, Information Retrieval, newly added documents, Generative

备注： Accepted to The Web Conference (WWW) 2026

点击查看摘要

Abstract:Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs). Despite its promise, existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs. While incremental training offers a straightforward remedy, it is computationally expensive, resource-intensive, and prone to catastrophic forgetting, thereby limiting the scalability and practicality of GR. In this paper, we identify the core bottleneck as the decoder's ability to map hidden states to the correct docIDs of newly added documents. Model editing, which enables targeted parameter modifications for docID mapping, represents a promising solution. However, applying model editing to current GR models is not trivial, which is severely hindered by indistinguishable edit vectors across queries, due to the high overlap of shared docIDs in retrieval results. To address this, we propose DOME (docID-oriented model editing), a novel method that effectively and efficiently adapts GR models to unseen documents. DOME comprises three stages: (1) identification of critical layers, (2) optimization of edit vectors, and (3) construction and application of updates. At its core, DOME employs a hybrid-label adaptive training strategy that learns discriminative edit vectors by combining soft labels, which preserve query-specific semantics for distinguishable updates, with hard labels that enforce precise mapping modifications. Experiments on widely used benchmarks, including NQ and MS MARCO, show that our method significantly improves retrieval performance on new documents while maintaining effectiveness on the original collection. Moreover, DOME achieves this with only about 60% of the training time required by incremental training, considerably reducing computational cost and enabling efficient, frequent model updates.

7. 【2603.02730】APAO: Adaptive Prefix-Aware Optimization for Generative Recommendation

链接：https://arxiv.org/abs/2603.02730

作者：Yuanqing Yu,Yifan Wang,Weizhi Ma,Zhiqiang Guo,Min Zhang

类目：Information Retrieval (cs.IR)

关键词：Generative recommendation, recently emerged, promising paradigm, paradigm in sequential, beam search

备注：

点击查看摘要

Abstract:Generative recommendation has recently emerged as a promising paradigm in sequential recommendation. It formulates the task as an autoregressive generation process, predicting discrete tokens of the next item conditioned on user interaction histories. Existing generative recommendation models are typically trained with token-level likelihood objectives, such as cross-entropy loss, while employing multi-step beam search during inference to generate ranked item candidates. However, this leads to a fundamental training-inference inconsistency: standard training assumes ground-truth history is always available, ignoring the fact that beam search prunes low-probability branches during inference. Consequently, the correct item may be prematurely discarded simply because its initial tokens (prefixes) have low scores. To address this issue, we propose the Adaptive Prefix-Aware Optimization (APAO) framework, which introduces prefix-level optimization losses to better align the training objective with the inference setting. Furthermore, we design an adaptive worst-prefix optimization strategy that dynamically focuses on the most vulnerable prefixes during training, thereby enhancing the model's ability to retain correct candidates under beam search constraints. We provide theoretical analyses to demonstrate the effectiveness and efficiency of our framework. Extensive experiments on multiple datasets further show that APAO consistently alleviates the training-inference inconsistency and improves performance across various generative recommendation backbones. Our codes are publicly available at this https URL.

8. 【2603.02725】S2CDR: Smoothing-Sharpening Process Model for Cross-Domain Recommendation

链接：https://arxiv.org/abs/2603.02725

作者：Xiaodong Li,Juwei Yue,Xinghua Zhang,Jiawei Sheng,Wenyuan Zhang,Taoyu Su,Zefeng Zhang,Tingwen Liu

类目：Information Retrieval (cs.IR)

关键词：User cold-start problem, user cold-start challenge, recommendation systems, long-standing challenge, User cold-start

备注： This paper is accepted by WWW'2026

点击查看摘要

Abstract:User cold-start problem is a long-standing challenge in recommendation systems. Fortunately, cross-domain recommendation (CDR) has emerged as a highly effective remedy for the user cold-start challenge, with recently developed diffusion models (DMs) demonstrating exceptional performance. However, these DMs-based CDR methods focus on dealing with user-item interactions, overlooking correlations between items across the source and target domains. Meanwhile, the Gaussian noise added in the forward process of diffusion models would hurt user's personalized preference, leading to the difficulty in transferring user preference across domains. To this end, we propose a novel paradigm of Smoothing-Sharpening Process Model for CDR to cold-start users, termed as S2CDR which features a corruption-recovery architecture and is solved with respect to ordinary differential equations (ODEs). Specifically, the smoothing process gradually corrupts the original user-item/item-item interaction matrices derived from both domains into smoothed preference signals in a noise-free manner, and the sharpening process iteratively sharpens the preference signals to recover the unknown interactions for cold-start users. Wherein, for the smoothing process, we introduce the heat equation on the item-item similarity graph to better capture the correlations between items across domains, and further build the tailor-designed low-pass filter to filter out the high-frequency noise information for capturing user's intrinsic preference, in accordance with the graph signal processing (GSP) theory. Extensive experiments on three real-world CDR scenarios confirm that our S2CDR significantly outperforms previous SOTA methods in a training-free manner.

9. 【2603.02653】AlphaFree: Recommendation Free from Users, IDs, and GNNs

链接：https://arxiv.org/abs/2603.02653

作者：Minseo Jeon,Junwoo Jung,Daewon Gwak,Jinhong Jung

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：effective recommender systems, design effective recommender, recommender systems free, recommender systems, effective recommender

备注： 13 pages, The Web Conference (WWW) 2026

点击查看摘要

Abstract:Can we design effective recommender systems free from users, IDs, and GNNs? Recommender systems are central to personalized content delivery across domains, with top-K item recommendation being a fundamental task to retrieve the most relevant items from historical interactions. Existing methods rely on entrenched design conventions, often adopted without reconsideration, such as storing per-user embeddings (user-dependent), initializing features from raw IDs (ID-dependent), and employing graph neural networks (GNN-dependent). These dependencies incur several limitations, including high memory costs, cold-start and over-smoothing issues, and poor generalization to unseen interactions. In this work, we propose AlphaFree, a novel recommendation method free from users, IDs, and GNNs. Our main ideas are to infer preferences on-the-fly without user embeddings (user-free), replace raw IDs with language representations (LRs) from pre-trained language models (ID-free), and capture collaborative signals through augmentation with similar items and contrastive learning, without GNNs (GNN-free). Extensive experiments on various real-world datasets show that AlphaFree consistently outperforms its competitors, achieving up to around 40% improvements over non-LR-based methods and up to 5.7% improvements over LR-based methods, while significantly reducing GPU memory usage by up to 69% under high-dimensional LRs.

Comments:
13 pages, The Web Conference (WWW) 2026

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.02653 [cs.IR]

(or
arXiv:2603.02653v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.02653

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3774904.3792355

Focus to learn more

            DOI(s) linking to related resources</p>

10. 【2603.02565】FlashEvaluator: Expanding Search Space with Parallel Evaluation

链接：https://arxiv.org/abs/2603.02565

作者：Chao Feng,Yuanhao Pu,Chenghao Zhang,Shanqi Liu,Shuchang Liu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Natural Language Processing, Language Processing, Natural Language, generator and selecting, selecting the top-ranked

备注： 23 pages, 2 figures

点击查看摘要

11. 【2603.02561】SOLAR: SVD-Optimized Lifelong Attention for Recommendation

链接：https://arxiv.org/abs/2603.02561

作者：Chenghao Zhang,Chao Feng,Yuanhao Pu,Xunyong Yang,Wenhui Yu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：global credit assignment, expressive global credit, makes long-context modeling, operator in Transformers, long-context modeling expensive

备注： 18 pages, 4 figures

点击查看摘要

Abstract:Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its $O(N^2 d)$ time and memory cost in sequence length $N$ makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to $O(N d^2)$ by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$. With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou's online recommendation scenario, SOLAR delivers a 0.68\% Video Views gain together with additional business metrics improvements.

12. 【2603.02555】Relevance Matters: A Multi-Task and Multi-Stage Large Language Model Approach for E-commerce Query Rewriting

链接：https://arxiv.org/abs/2603.02555

作者：Aijun Dai,Jixiang Zhang,Haiqing Hu,Guoyu Tang,Lin Liu,Ziguang Cheng

类目：Information Retrieval (cs.IR)

关键词：users' behavioral responses, returned products, click-through rate, query rewriting, experience is measured

备注： Accepted for publication at ICDE 2026

点击查看摘要

Abstract:For e-commerce search, user experience is measured by users' behavioral responses to returned products, like click-through rate and conversion rate, as well as the relevance between returned products and search queries. Consequently, relevance and user conversion constitute the two primary objectives in query rewriting, a strategy to bridge the lexical gap between user expressions and product descriptions. This research proposes a multi-task and multi-stage query rewriting framework grounded in large language models (LLMs). Critically, in contrast to previous works that primarily emphasized rewritten query generation, we inject the relevance task into query rewriting. Specifically, leveraging a pretrained model on user data and product information from this http URL, the approach initiates with multi-task supervised fine-tuning (SFT) comprising of the rewritten query generation task and the relevance tagging task between queries and rewrites. Subsequently, we employ Group Relative Policy Optimization (GRPO) for the model's objective alignment oriented toward enhancing the relevance and stimulating user conversions. Through offline evaluation and online A/B test, our framework illustrates substantial improvements in the effectiveness of e-commerce query rewriting, resulting in elevating the search results' relevance and boosting the number of purchases made per user (UCVR). Since August 2025, our approach has been implemented on this http URL, one of China's leading online shopping platforms.

13. 【2603.02519】Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling

链接：https://arxiv.org/abs/2603.02519

作者：Wei Jiang,Tong Chen,Wei Yuan,Quoc Viet Hung Nguyen,Hongzhi Yin

类目：Multimedia (cs.MM); Information Retrieval (cs.IR)

关键词：detecting multi-modal misinformation, Vision-language models, social platforms, delayed annotations, multi-modal misinformation

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in the more complex mixed-source multi-modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero-shot M3D by connecting modality-specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi-agent framework for zero-shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test-time scaling paradigm in which each modality-specific VLM agent applies a Best-of-N mechanism, coupled with a critic agent for task-aligned scoring. The agents are organized in a cascading, modality-specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state-of-the-art zero-shot detection performance compared with various VLM-based and agentic baselines.

14. 【2603.02248】HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval

链接：https://arxiv.org/abs/2603.02248

作者：Sungho Park,Joohyung Yun,Jongwuk Lee,Wook-Shin Han

类目：Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：open-domain question answering, Table-text retrieval aims, support open-domain question, question answering, Table-text retrieval

备注： 9 pages, 6 figures. Accepted at ACL 2025 main. Project page: [this https URL](https://helios-projectpage.github.io/)

点击查看摘要

计算机视觉

1. 【2603.03283】Utonia: Toward One Encoder for All Point Clouds

链接：https://arxiv.org/abs/2603.03283

作者：Yujia Zhang,Xiaoyang Wu,Yunhan Yang,Xianzhe Fan,Han Li,Yuechen Zhang,Zehao Huang,Naiyan Wang,Hengshuang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：point clouds, indoor RGB-D sequences, single self-supervised point, point clouds lifted, shape a single

备注： produced by Pointcept, project page: [this https URL](https://pointcept.github.io/Utonia)

点击查看摘要

Abstract:We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

2. 【2603.03282】MIBURI: Towards Expressive Interactive Gesture Synthesis

链接：https://arxiv.org/abs/2603.03282

作者：M. Hamza Mughal,Rishabh Dabral,Vera Demberg,Christian Theobalt

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)

关键词：Embodied Conversational Agents, Embodied Conversational, based conversational agents, Conversational Agents, conversational agents lack

备注： CVPR 2026. Project page: [this https URL](https://vcai.mpi-inf.mpg.de/projects/MIBURI/)

点击查看摘要

Abstract:Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on this https URL.

3. 【2603.03281】CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

链接：https://arxiv.org/abs/2603.03281

作者：Hanyang Wang,Yiyang Liu,Jiawei Chi,Fangfu Liu,Ran Xue,Yueqi Duan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：central approach, approach for enhancing, CFG, Mode Control CFG, flow-based diffusion models

备注： Accepted by CVPR 2026; Project Page: [this https URL](https://hanyang-21.github.io/CFG-Ctrl)

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: this https URL

4. 【2603.03280】How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

链接：https://arxiv.org/abs/2603.03280

作者：Toru Lin,Shuying Deng,Zhao-Heng Yin,Pieter Abbeel,Jitendra Malik

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词：essential manipulation tasks, food preparation, remain intractable, autonomous robots, essential manipulation

备注： Project page can be found at [this https URL](https://toruowo.github.io/peel)

点击查看摘要

Abstract:Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

5. 【2603.03279】ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

链接：https://arxiv.org/abs/2603.03279

作者：Xialin He,Sirui Xu,Xinyao Li,Runpei Dong,Liuyu Bian,Yu-Xiong Wang,Liang-Yan Gui

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：making humanoids practically, remains a central, central barrier, barrier to making, Achieving autonomous

备注： Project Page: [this https URL](https://ultra-humanoid.github.io/)

点击查看摘要

Abstract:Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

6. 【2603.03278】her: Autonomous Functional Play with Correspondence-Driven Trajectory Warping

链接：https://arxiv.org/abs/2603.03278

作者：William Liang,Sam Wang,Hung-Ju Wang,Osbert Bastani,Yecheng Jason Ma,Dinesh Jayaraman

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：offering a scalable, ability to conduct, conduct and learn, scalable alternative, alternative to labor-intensive

备注： International Conference on Learning Representations (ICLR), 2026. Project website and code: [this https URL](https://tether-research.github.io)

点击查看摘要

Abstract:The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.

7. 【2603.03276】Beyond Language Modeling: An Exploration of Multimodal Pretraining

链接：https://arxiv.org/abs/2603.03276

作者：Shengbang Tong,David Fan,John Nguyen,Ellis Brown,Gaoyue Zhou,Shengyi Qian,Boyang Zheng,Théophane Vallaeys,Junlin Han,Rob Fergus,Naila Murray,Marjan Ghazvininejad,Mike Lewis,Nicolas Ballas,Amir Bar,Michael Rabbat,Jakob Verbeek,Luke Zettlemoyer,Koustuv Sinha,Yann LeCun,Saining Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advancing foundation models, offers a critical, critical axis, axis for advancing, advancing foundation

备注： Project website at [this https URL](https://beyond-llms.github.io/)

点击查看摘要

Abstract:The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

8. 【2603.03269】LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

链接：https://arxiv.org/abs/2603.03269

作者：Junyi Zhang,Charles Herrmann,Junhwa Hur,Chen Sun,Ming-Hsuan Yang,Forrester Cole,Trevor Darrell,Deqing Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：geometric foundation models, Long-context Geometric Reconstruction, Feedforward geometric foundation, quadratic attention complexity, limited effective memory

备注： Project page: [this https URL](https://LoGeR-project.github.io/)

点击查看摘要

Abstract:Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

9. 【2603.03265】DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

链接：https://arxiv.org/abs/2603.03265

作者：Yufu Wang,Evonne Ng,Soyong Shin,Rawal Khirodkar,Yuan Dong,Zhaoen Su,Jinhyung Park,Kris Kitani,Alexander Richard,Fabian Prada,Michael Zollhofer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recovers human motion, recovers human, motion, unconstrained videos, incomplete observations

备注： CVPR 2026. Project page: [this https URL](https://yufu-wang.github.io/duomo/)

点击查看摘要

Abstract:We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: this https URL

10. 【2603.03241】UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

链接：https://arxiv.org/abs/2603.03241

作者：Zimo Wen,Boxiu Li,Wanbo Zhang,Junxiang Lei,Xiaoyu Chen,Yijia Fan,Qi Zhang,Yujiang Wang,Lili Qiu,Bo Li,Ziwei Liu,Caihua Shan,Yifan Yang,Yifei Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：strong generative capabilities, recently demonstrated strong, demonstrated strong generative, understanding remains unclear, improves understanding remains

备注：

点击查看摘要

Abstract:Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

11. 【2603.03239】COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

链接：https://arxiv.org/abs/2603.03239

作者：Miguel Espinosa,Eva Gmelich Meijling,Valerio Marsocci,Elliot J. Crowley,Mikolaj Czerkawski

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：applications increasingly rely, observation applications increasingly, Earth observation applications, Earth observation, land-cover products

备注：

点击查看摘要

Abstract:Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// this http URL

12. 【2603.03198】ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

链接：https://arxiv.org/abs/2603.03198

类目：Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：demands robust generalization, unmanned aerial vehicles, intelligence demands robust, demands robust, unmanned aerial

备注： Code: [this https URL](https://github.com/ACE-BRAIN-Team/ACE-Brain-0) Hugging Face: [this https URL](https://huggingface.co/ACE-Brain/ACE-Brain-0-8B)

点击查看摘要

13. 【2603.03197】Specificity-aware reinforcement learning for fine-grained open-world classification

链接：https://arxiv.org/abs/2603.03197

作者：Samuele Angheben,Davide Berasi,Alessandro Conti,Elisa Ricci,Yiming Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：predefined label set, Classifying fine-grained visual, Large Multimodal Models, Classifying fine-grained, reasoning Large Multimodal

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at this https URL.

14. 【2603.03195】Chain of World: World Model Thinking in Latent Motion

链接：https://arxiv.org/abs/2603.03195

作者：Fuxiang Yang,Donglin Di,Lulu Tang,Xuancheng Zhang,Lei Fan,Hao Li,Chen Wei,Tonghua Su,Baorui Ma

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：underlying visual dynamics, temporal-causal structure underlying, structure underlying visual, embodied intelligence, promising path

备注： Accepted by CVPR2026. Project page: [this https URL](https://fx-hit.github.io/cowvla-io/)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at this https URL.

15. 【2603.03192】MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

链接：https://arxiv.org/abs/2603.03192

作者：Ashutosh Chaubey,Jiacheng Pang,Mohammad Soleymani

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Omni-modal large language, recently achieved strong, achieved strong performance, remain highly susceptible, dominant language priors

备注： CVPR 2026. Project Page: [this https URL](https://mod-dpo.github.io/)

点击查看摘要

16. 【2603.03187】ProSMA-UNet: Decoder Conditioning for Proximal-Sparse Skip Feature Selection

链接：https://arxiv.org/abs/2603.03187

作者：Chun-Wun Cheng,Yanqi Cheng,Peiyuan Jing,Guang Yang,Carola-Bibiane Schönlieb,Angelica I. Aviles-Rivero

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：U-shaped encoder-decoder architectures, Medical image segmentation, connections preserve fine, preserve fine spatial, fine spatial detail

备注：

点击查看摘要

Abstract:Medical image segmentation commonly relies on U-shaped encoder-decoder architectures such as U-Net, where skip connections preserve fine spatial detail by injecting high-resolution encoder features into the decoder. However, these skip pathways also propagate low-level textures, background clutter, and acquisition noise, allowing irrelevant information to bypass deeper semantic filtering -- an issue that is particularly detrimental in low-contrast clinical imaging. Although attention gates have been introduced to address this limitation, they typically produce dense sigmoid masks that softly reweight features rather than explicitly removing irrelevant activations. We propose ProSMA-UNet (Proximal-Sparse Multi-Scale Attention U-Net), which reformulates skip gating as a decoder-conditioned sparse feature selection problem. ProSMA constructs a multi-scale compatibility field using lightweight depthwise dilated convolutions to capture relevance across local and contextual scales, then enforces explicit sparsity via an $\ell_1$ proximal operator with learnable per-channel thresholds, yielding a closed-form soft-thresholding gate that can remove noisy responses. To further suppress semantically irrelevant channels, ProSMA incorporates decoder-conditioned channel gating driven by global decoder context. Extensive experiments on challenging 2D and 3D benchmarks demonstrate state-of-the-art performance, with particularly large gains ($\approx20$\%) on difficult 3D segmentation tasks. Project page: this https URL

17. 【2603.03163】Conditioned Activation Transport for T2I Safety Steering

链接：https://arxiv.org/abs/2603.03163

作者：Maciej Chrabąszcz,Aleksander Szymczyk,Jan Dubiński,Tomasz Trzciński,Franziska Boenisch,Adam Dziedzic

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：models remain prone, impressive capabilities, models remain, toxic content, remain prone

备注：

点击查看摘要

Abstract:Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.

18. 【2603.03160】Kling-MotionControl Technical Report

链接：https://arxiv.org/abs/2603.03160

作者：Kling Team:Jialu Chen,Yikang Ding,Zhixue Fang,Kun Gai,Kang He,Xu He,Jingyun Hua,Mingming Lao,Xiaohan Li,Hui Liu,Jiwen Liu,Xiaoqiang Liu,Fan Shi,Xiaoyu Shi,Peiqin Sun,Songlin Tang,Pengfei Wan,Tiancheng Wen,Zhiyong Wu,Haoxian Zhang,Runze Zhao,Yuanxing Zhang,Yan Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generate lifelike videos, Character animation aims, transferring motion dynamics, Character animation, driving video

备注： Access: [this https URL](https://app.klingai.com/global/video-motion-control/new)

点击查看摘要

Abstract:Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.

19. 【2603.03143】Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

链接：https://arxiv.org/abs/2603.03143

作者：Jiyuan Wang,Chunyu Lin,Lei Sun,Zhi Cao,Yuyang Yin,Lang Nie,Zhenlong Yuan,Xiangxiang Chu,Yunchao Wei,Kang Liao,Guosheng Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：promising paradigm, Leveraging, maintaining multi-view consistency, Leveraging the priors, diffusion models

备注： 18 pages, 8 figures

点击查看摘要

Abstract:Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

20. 【2603.03125】AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis

链接：https://arxiv.org/abs/2603.03125

作者：Maryam Heidari(1),Nantheera Anantrasirichai(1),Steven Walker(2),Rahul Bhatnagar(2),Alin Achim(1) ((1) University of Bristol, UK, (2) Bristol Medical School, University of Bristol, UK)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：portable imaging modality, Lung ultrasound, machine learning methods, Generative Adversarial Networks, imaging modality

备注： 5 pages5 pages, 4 figures. Accepted to ICASSP 2026

点击查看摘要

Abstract:Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.

21. 【2603.03101】MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

链接：https://arxiv.org/abs/2603.03101

作者：Jun Yeong Park,JunYoung Seo,Minji Kang,Yu Rang Park

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Zero-Shot Anomaly Detection, driven recent success, CLIP model outstanding, anomaly detection tasks, model outstanding generalization

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at this https URL.

22. 【2603.03075】nyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference

链接：https://arxiv.org/abs/2603.03075

作者：Mhd Rashed Al Koutayni,Mohamed Selim,Gerd Reis,Alain Pagani,Didier Stricker

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)

关键词：Accurate sea ice, safe maritime navigation, conditions require timely, Synthetic Aperture Radar, changing ice conditions

备注： undergoing publication at CVC 2026

点击查看摘要

Abstract:Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel-1 Synthetic Aperture Radar (SAR) provides high-resolution, all-weather observations of sea ice, conventional ground-based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On-board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co-designed for on-board Stage of Development (SOD) mapping from dual-polarized Sentinel-1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR-aware architectural simplifications with low-precision quantization to balance accuracy and efficiency. The model is synthesized using High-Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near-real-time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, underscoring the potential of chip-level hardware-algorithm co-design for future spaceborne and edge AI systems.

23. 【2603.03072】kZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

链接：https://arxiv.org/abs/2603.03072

作者：Christian Greisinger,Steffen Eger

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large language models, Large language, diverse workflows, assist scientists, scientists across diverse

备注：

点击查看摘要

24. 【2603.03066】EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education

链接：https://arxiv.org/abs/2603.03066

作者：Baoliang Chen,Xinlong Bu,Lingyu Zhu,Hanwei Zhu,Xiangjie Sui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely untapped, achieved remarkable success, education remains largely, generating photorealistic videos, support visual

备注：

点击查看摘要

Abstract:While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.

25. 【2603.03043】IoUCert: Robustness Verification for Anchor-based Object Detectors

链接：https://arxiv.org/abs/2603.03043

作者：Benedikt Brückner,Alejandro J. Mercado,Yanghao Zhang,Panagiotis Kouvaros,Alessio Lomuscio

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：remains notoriously difficult, notoriously difficult due, detection remains notoriously, object detection remains, image classification

备注：

点击查看摘要

Abstract:While formal robustness verification has seen significant success in image classification, scaling these guarantees to object detection remains notoriously difficult due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. We introduce {\sc \sf IoUCert}, a novel formal verification framework designed specifically to overcome these bottlenecks in foundational anchor-based object detection architectures. Focusing on the object localisation component in single-object settings, we propose a coordinate transformation that enables our algorithm to circumvent precision-degrading relaxations of non-linear box prediction functions. This allows us to optimise bounds directly with respect to the anchor box offsets which enables a novel Interval Bound Propagation method that derives optimal IoU bounds. We demonstrate that our method enables, for the first time, the robustness verification of realistic, anchor-based models including SSD, YOLOv2, and YOLOv3 variants against various input perturbations.

26. 【2603.03030】BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology

链接：https://arxiv.org/abs/2603.03030

作者：Xiaojing Guo,Jiatai Lin,Yumian Jia,Jingqi Huang,Zeyan Xu,Weidong Li,Longfei Wang,Jingjing Chen,Qin Li,Weiwei Wang,Lifang Cui,Wen Yue,Zhiqiang Cheng,Xiaolong Wei,Jianzhong Yu,Xia Jin,Baizhou Li,Honghong Shen,Jing Li,Chunlan Li,Yanfen Cui,Yi Dai,Yiling Yang,Xiaolong Qian,Liu Yang,Yang Yang,Guangshen Gao,Yaqing Li,Lili Zhai,Chenying Liu,Tianhua Zhang,Zhenwei Shi,Cheng Lu,Xingchen Zhou,Jing Xu,Miaoqing Zhao,Fang Mei,Jiaojiao Zhou,Ning Mao,Fangfang Liu,Chu Han,Zaiyi Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated remarkable predictive, remarkable predictive capabilities, large-scale multi-organ datasets, pathology foundation models, diverse clinical applications

备注：

点击查看摘要

Abstract:Generalist pathology foundation models (PFMs), pretrained on large-scale multi-organ datasets, have demonstrated remarkable predictive capabilities across diverse clinical applications. However, their proficiency on the full spectrum of clinically essential tasks within a specific organ system remains an open question due to the lack of large-scale validation cohorts for a single organ as well as the absence of a tailored training paradigm that can effectively translate broad histomorphological knowledge into the organ-specific expertise required for specialist-level interpretation. In this study, we propose BRIGHT, the first PFM specifically designed for breast pathology, trained on approximately 210 million histopathology tiles from over 51,000 breast whole-slide images derived from a cohort of over 40,000 patients across 19 hospitals. BRIGHT employs a collaborative generalist-specialist framework to capture both universal and organ-specific features. To comprehensively evaluate the performance of PFMs on breast oncology, we curate the largest multi-institutional cohorts to date for downstream task development and evaluation, comprising over 25,000 WSIs across 10 hospitals. The validation cohorts cover the full spectrum of breast pathology across 24 distinct clinical tasks spanning diagnosis, biomarker prediction, treatment response and survival prediction. Extensive experiments demonstrate that BRIGHT outperforms three leading generalist PFMs, achieving state-of-the-art (SOTA) performance in 21 of 24 internal validation tasks and in 5 of 10 external validation tasks with excellent heatmap interpretability. By evaluating on large-scale validation cohorts, this study not only demonstrates BRIGHT's clinical utility in breast oncology but also validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs on a specific organ system.

27. 【2603.03026】Any Resolution Any Geometry: From Multi-View To Multi-Patch

链接：https://arxiv.org/abs/2603.03026

作者：Wenqing Cui,Zhenyu Li,Mykola Lavreniuk,Jian Shi,Ramzi Idoughi,Xiangjun Tang,Peter Wonka

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：prediction remains difficult, remains difficult due, preserving fine local, fine local detail, high-resolution prediction remains

备注： Project webpage: [this https URL](https://github.com/Dreamaker-MrC/Any-Resolution-Any-Geometry)

点击查看摘要

Abstract:Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.

28. 【2603.02986】VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats

链接：https://arxiv.org/abs/2603.02986

作者：Alessio Mazzucchelli,Ivan Ojeda-Martin,Fernando Rivas-Manzaneque,Elena Garces,Adrian Penate-Sanchez,Francesc Moreno-Noguer

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Gaussian Splatting, accurately model complex, unprecedented rendering performance, model complex, rendering performance

备注： IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026 Feb 24

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently transformed the fields of novel view synthesis and 3D reconstruction due to its ability to accurately model complex 3D scenes and its unprecedented rendering performance. However, a significant challenge persists: the absence of an efficient and photorealistic method for editing the appearance of the scene's content. In this paper we introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS while preserving view-dependent effects such as specular highlights. Key to our method are a novel architecture that separates color into diffuse and view-dependent components, and a multi-view training strategy that integrates image patches from multiple viewpoints. Improving over the conventional single-view batch training, our 3DGS representation provides more accurate reconstruction and serves as a solid representation for the recoloring task. For 3DGS recoloring, we then introduce a rapid scheme requiring only one manually edited image of the scene from the end-user. By fine-tuning the weights of a single MLP, alongside a module for single-shot segmentation of the editable area, the color edits are seamlessly propagated to the entire scene in just two seconds, facilitating real-time interaction and providing control over the strength of the view-dependent effects. An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors based on Neural Radiance Fields representations.

29. 【2603.02985】he Dresden Dataset for 4D Reconstruction of Non-Rigid Abdominal Surgical Scenes

链接：https://arxiv.org/abs/2603.02985

作者：Reuben Docea,Rayan Younis,Yonghao Long,Maxime Fleury,Jinjing Xu,Chenyang Li,André Schulze,Ann Wierick,Johannes Bender,Micha Pfeiffer,Qi Dou,Martin Wagner,Stefanie Speidel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：realistic surgical conditions, paired endoscopic video, deforming abdominal soft, abdominal soft tissue, high-quality structured-light geometry

备注： 16 pages, 10 figures, accompanying data descriptor for dataset, submitted to Scientific Data

点击查看摘要

Abstract:The D4D Dataset provides paired endoscopic video and high-quality structured-light geometry for evaluating 3D reconstruction of deforming abdominal soft tissue in realistic surgical conditions. Data were acquired from six porcine cadaver sessions using a da Vinci Xi stereo endoscope and a Zivid structured-light camera, registered via optical tracking and manually curated iterative alignment methods. Three sequence types - whole deformations, incremental deformations, and moved-camera clips - probe algorithm robustness to non-rigid motion, deformation magnitude, and out-of-view updates. Each clip provides rectified stereo images, per-frame instrument masks, stereo depth, start/end structured-light point clouds, curated camera poses and camera intrinsics. In postprocessing, ICP and semi-automatic registration techniques are used to register data, and instrument masks are created. The dataset enables quantitative geometric evaluation in both visible and occluded regions, alongside photometric view-synthesis baselines. Comprising over 300,000 frames and 369 point clouds across 98 curated recordings, this resource can serve as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods.

30. 【2603.02974】Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

链接：https://arxiv.org/abs/2603.02974

作者：Ertunc Erdil,Nico Schulthess,Guney Tombak,Ender Konukoglu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：DINO models provide, provide rich patch-level, recently enabled strong, rich patch-level representations, models provide rich

备注：

点击查看摘要

Abstract:DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: this https URL.

31. 【2603.02972】agaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

链接：https://arxiv.org/abs/2603.02972

作者：Jiaxing Liu,Zexi Zhang,Xiaoyan Li,Boyue Wang,Yongli Hu,Baocai Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：disembodied vision-language tasks, Large Vision-Language Models, inherent architectural mismatch, Large Vision-Language, challenge for Large

备注：

点击查看摘要

Abstract:Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon this http URL page: this https URL

32. 【2603.02964】Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

链接：https://arxiv.org/abs/2603.02964

作者：Wensheng Wu,Zheming Lu,Ziqian Lu,Zewei He,Xuecheng Sun,Zhao Wang,Jungong Han,Yunlong Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：faces significant challenges, significant challenges due, Industrial anomaly detection, detection faces significant, Industrial anomaly

备注：

点击查看摘要

Abstract:Industrial anomaly detection faces significant challenges due to the scarcity of anomalous samples and the complexity of real-world anomalies. In this paper, we propose a foundation model-based anomaly synthesis pipeline (FMAS) that generates highly realistic anomalous samples without fine-tuning or class-specific training. Motivated by the distinct frequency-domain characteristics of anomalies, we introduce aWavelet Domain Attention Module (WDAM), which exploits adaptive sub-band processing to enhance anomaly feature extraction. The combination of FMAS and WDAM significantly improves anomaly detection sensitivity while maintaining computational efficiency. Comprehensive experiments on MVTec AD and VisA datasets demonstrate that WDAM, as a plug-and-play module, achieves substantial performance gains against existing baselines.

33. 【2603.02959】Semi-Supervised Few-Shot Adaptation of Vision-Language Models

链接：https://arxiv.org/abs/2603.02959

作者：Julio Silva-Rodríguez,Ender Konukoglu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：providing rich multi-modal, rich multi-modal embeddings, heterogeneous data sources, pre-trained on large, Vision-language models

备注： Code: [this https URL](https://github.com/jusiro/SS-Text-U)

点击查看摘要

Abstract:Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by 50% in low-shot regimes.

34. 【2603.02957】Leveraging Label Proportion Prior for Class-Imbalanced Semi-Supervised Learning

链接：https://arxiv.org/abs/2603.02957

作者：Kohki Akiba,Shinnosuke Matsuo,Shota Harada,Ryoma Bise

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：pseudo-labeling amplifies majority, Semi-supervised learning, Proportion Loss, pseudo-labeling amplifies, amplifies majority bias

备注：

点击查看摘要

Abstract:Semi-supervised learning (SSL) often suffers under class imbalance, where pseudo-labeling amplifies majority bias and suppresses minority performance. We address this issue with a lightweight framework that, to our knowledge, is the first to introduce Proportion Loss from learning from label proportions (LLP) into SSL as a regularization term. Proportion Loss aligns model predictions with the global class distribution, mitigating bias across both majority and minority classes. To further stabilize training, we formulate a stochastic variant that accounts for fluctuations in mini-batch composition. Experiments on the Long-tailed CIFAR-10 benchmark show that integrating Proportion Loss into FixMatch and ReMixMatch consistently improves performance over the baselines across imbalance severities and label ratios, and achieves competitive or superior results compared to existing CISSL methods, particularly under scarce-label conditions.

35. 【2603.02951】CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

链接：https://arxiv.org/abs/2603.02951

作者：Zhenquan Yao,Zitong Huang,Yihan Zeng,Jianhua Han,Hang Xu,Chun-Mei Feng,Jianwei Ma,Wangmeng Zuo

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Graphical User Interface, Graphical User, User Interface, achieved significant development, multimodal large language

备注：

点击查看摘要

Abstract:Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.

36. 【2603.02943】C-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

链接：https://arxiv.org/abs/2603.02943

作者：Benlei Cui,Shaoxuan He,Bukun Huang,Zhizeng Ye,Yunyun Sun,Longtao Huang,Hui Xue,Yang Yang,Jingqun Tang,Zhou Zhao,Haiwen Hong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：substantial computational burden, iterative sampling process, diffusion models, models are hindered, substantial computational

备注： CVPR 2026

点击查看摘要

Abstract:Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.

37. 【2603.02929】RACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

链接：https://arxiv.org/abs/2603.02929

作者：Xiangzhao Hao,Shijie Wang,Tianyu Yang,Tianyue Wang,Haiyun Guo,JinQiao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Universal Multimodal Retrieval, Multimodal Large Language, interpreting diverse user, Universal Multimodal, Large Language Models

备注：

点击查看摘要

Abstract:Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

38. 【2603.02926】GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights

链接：https://arxiv.org/abs/2603.02926

作者：Qiming He,Jing Li,Tian Guan,Yifei Ma,Zimo Zhao,Yanxia Wang,Hongjing Chen,Yingming Xu,Shuang Ge,Yexing Zhang,Yizhi Wang,Xinrui Chen,Lianghui Zhu,Yiqing Liu,Qingxia Hou,Shuyan Zhao,Xiaoqin Wang,Lili Ma,Peizhen Hu,Qiang Huang,Zihan Wang,Zhiyuan Shen,Junru Cheng,Siqi Zeng,Jiurun Chen,Zhen Song,Chao He,Zhe Wang,Yonghong He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：patterns remain challenging, fine-grained lesion patterns, lesion patterns remain, current AI approaches, diagnosis and prognosis

备注：

点击查看摘要

Abstract:Glomerular pathology is central to the diagnosis and prognosis of renal diseases, yet the heterogeneity of glomerular morphology and fine-grained lesion patterns remain challenging for current AI approaches. We present GloPath, an entity-centric foundation model trained on over one million glomeruli extracted from 14,049 renal biopsy specimens using multi-scale and multi-view self-supervised learning. GloPath addresses two major challenges in nephropathology: glomerular lesion assessment and clinicopathological insights discovery. For lesion assessment, GloPath was benchmarked across three independent cohorts on 52 tasks, including lesion recognition, grading, few-shot classification, and cross-modality diagnosis-outperforming state-of-the-art methods in 42 tasks (80.8%). In the large-scale real-world study, it achieved an ROC-AUC of 91.51% for lesion recognition, demonstrating strong robustness in routine clinical settings. For clinicopathological insights, GloPath systematically revealed statistically significant associations between glomerular morphological parameters and clinical indicators across 224 morphology-clinical variable pairs, demonstrating its capacity to connect tissue-level pathology with patient-level outcomes. Together, these results position GloPath as a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, representing a step toward clinically translatable AI in renal pathology.

39. 【2603.02924】HDINO: A Concise and Efficient Open-Vocabulary Detector

链接：https://arxiv.org/abs/2603.02924

作者：Hao Zhang,Yiqun Wang,Qinran Lin,Runze Fan,Yong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：existing methods rely, methods rely heavily, manually curated fine-grained, resource-intensive layer-wise cross-modal, curated fine-grained training

备注：

点击查看摘要

Abstract:Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at this https URL.

40. 【2603.02919】Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

链接：https://arxiv.org/abs/2603.02919

作者：Youngjun Jun,Seil Kang,Woojung Han,Seong Jae Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Video Diffusion Transformers, Diffusion Transformers, synthesizing high-quality video, Video Diffusion, text descriptions involving

备注： CVPR 2026

点击查看摘要

Abstract:Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

41. 【2603.02910】Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement

链接：https://arxiv.org/abs/2603.02910

作者：Hao Ai,Wenjie Chang,Jianbo Jiao,Ales Leonardis,Ofek Eyal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Articulated objects, daily life, ubiquitous in daily, Articulated, parts

备注： Accepted by ICLR 2026. Project Page: [this https URL](https://haoai-1997.github.io/AiM/)

点击查看摘要

Abstract:Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyse two different articulation states and perform per-point part segmentation, optimising per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a new framework, Articulation in Motion (AiM). We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a user-object interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC is employed to achieve part mobility analysis without any part-level structural priors, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validates the effectiveness and strong generalisation ability of our approach. Project page: this https URL.

42. 【2603.02907】Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework

链接：https://arxiv.org/abs/2603.02907

作者：Chenran Lin,Lok Ming Lui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Harmonic Beltrami Signature, Beltrami Signature Network, Harmonic Beltrami, Beltrami Signature, presents the Harmonic

备注：

点击查看摘要

Abstract:This paper presents the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture for computing the Harmonic Beltrami Signature (HBS) from binary-like images. HBS is a shape representation that provides a one-to-one correspondence with 2D simply connected shapes, with invariance to translation, scaling, and rotation. By exploiting the function approximation capacity of neural networks, HBSN enables efficient extraction and utilization of shape prior information. The proposed network architecture incorporates a pre-Spatial Transformer Network (pre-STN) for shape normalization, a UNet-based backbone for HBS prediction, and a post-STN for angle regularization. Experiments show that HBSN accurately computes HBS representations, even for complex shapes. Furthermore, we demonstrate how HBSN can be directly incorporated into existing deep learning segmentation models, improving their performance through the use of shape priors. The results confirm the utility of HBSN as a general-purpose module for embedding geometric shape information into computer vision pipelines.

43. 【2603.02897】ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

链接：https://arxiv.org/abs/2603.02897

作者：Hao Cao,Chengbin Liang,Wenqi Guo,Zhijin Qin,Jungong Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：delivered remarkable improvements, generative image compression, Recent advances, Progressive Generative Image, generative image

备注：

点击查看摘要

Abstract:Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

44. 【2603.02896】3D-DRES: Detailed 3D Referring Expression Segmentation

链接：https://arxiv.org/abs/2603.02896

作者：Qi Chen,Changli Wu,Jiayi Ji,Yiwei Ma,Liujuan Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rich compositional contextual, compositional contextual reasonings, visual grounding tasks, natural language expressions, Referring Expression Segmentation

备注： AAAI2026

点击查看摘要

Abstract:Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

45. 【2603.02893】Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

链接：https://arxiv.org/abs/2603.02893

作者：Kaiqiang Xiong,Rui Peng,Jiahao Wu,Zhanke Wang,Jie Liang,Xiaoyun Zheng,Feng Gao,Ronggang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Score Distillation Sampling, Distillation Sampling, challenging problem, exclusively studied, Score Distillation

备注：

点击查看摘要

Abstract:3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

46. 【2603.02888】LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

链接：https://arxiv.org/abs/2603.02888

作者：Minh-Chi Phung,Thien-Bao Le,Cam-Tu Tran-Thi,Thu-Dieu Nguyen-Thi,Vu-Hung Dao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：domain-specific knowledge integration, video data demand, data demand retrieval, demand retrieval systems, retrieval systems capable

备注： Accepted by AAAI 2026 Workshop on New Frontiers in Information Retrieval

点击查看摘要

Abstract:The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.

47. 【2603.02887】Generalized non-exponential Gaussian splatting

链接：https://arxiv.org/abs/2603.02887

作者：Sébastien Speierer,Adrian Jarabo

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian splatting, physically-based alpha-blending operators, wider family, family of physically-based, alpha-blending operators

备注： 13 pages, 6 figures, 4 tables

点击查看摘要

Abstract:In this work we generalize 3D Gaussian splatting (3DGS) to a wider family of physically-based alpha-blending operators. 3DGS has become the standard de-facto for radiance field rendering and reconstruction, given its flexibility and efficiency. At its core, it is based on alpha-blending sorted semitransparent primitives, which in the limit converges to the classic radiative transfer function with exponential transmittance. Inspired by recent research on non-exponential radiative transfer, we generalize the image formation model of 3DGS to non-exponential regimes. Based on this generalization, we use a quadratic transmittance to define sub-linear, linear, and super-linear versions of 3DGS, which exhibit faster-than-exponential decay. We demonstrate that these new non-exponential variants achieve similar quality than the original 3DGS but significantly reduce the number of overdraws, which result on speed-ups of up to $4\times$ in complex real-world captures, on a ray-tracing-based renderer.

48. 【2603.02886】StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

链接：https://arxiv.org/abs/2603.02886

作者：Guoqing Ma,Xun Lin,Hui Ma,Ajian Liu,Yizhong Liu,Wenzhong Tang,Shan Yu,Chenqi Kong,Yi Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：models assume access, Face Forgery Detection, Forgery Detection, assume access, FFD

备注： Accepted by Machine Intelligence Research

点击查看摘要

Abstract:Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model's ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers' suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.

49. 【2603.02883】SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

链接：https://arxiv.org/abs/2603.02883

作者：Wonsuk Jang,Thierry Tambe

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Transformers, hinder edge deployment, achieve strong video, strong video generation, compute costs hinder

备注：

点击查看摘要

Abstract:Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.

50. 【2603.02882】SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

链接：https://arxiv.org/abs/2603.02882

作者：Xinjie Zhu,Zijing Zhao,Hui Jin,Qingxiao Guo,Yilong Ma,Yunhao Wang,Xiaobing Guo,Weifeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, Intelligence Generated, Generated Content

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at this https URL.

51. 【2603.02872】hink-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

链接：https://arxiv.org/abs/2603.02872

作者：Jialiang Zhang,Junlong Tong,Junyan Lin,Hao Wu,Yirong Sun,Yunpu Ma,Xiaoyu Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision Language, Vision Language Models, Large Vision, Language Models, Vision Language

备注：

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{this https URL}{this repository.}

52. 【2603.02866】Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis

链接：https://arxiv.org/abs/2603.02866

作者：Kaiqiang Xiong,Zhanke Wang,Ronggang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, mechanism for hierarchical, view synthesis, central mechanism, inject fine Gaussians

备注：

点击查看摘要

Abstract:We present multimodal-prior-guided importance sampling as the central mechanism for hierarchical 3D Gaussian Splatting (3DGS) in sparse-view novel view synthesis. Our sampler fuses complementary cues { -- } photometric rendering residuals, semantic priors, and geometric priors { -- } to produce a robust, local recoverability estimate that directly drives where to inject fine Gaussians. Built around this sampling core, our framework comprises (1) a coarse-to-fine Gaussian representation that encodes global shape with a stable coarse layer and selectively adds fine primitives where the multimodal metric indicates recoverable detail; and (2) a geometric-aware sampling and retention policy that concentrates refinement on geometrically critical and complex regions while protecting newly added primitives in underconstrained areas from premature pruning. By prioritizing regions supported by consistent multimodal evidence rather than raw residuals alone, our method alleviates overfitting texture-induced errors and suppresses noise from pose/appearance inconsistencies. Experiments on diverse sparse-view benchmarks demonstrate state-of-the-art reconstructions, with up to +0.3 dB PSNR on DTU.

53. 【2603.02865】Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

链接：https://arxiv.org/abs/2603.02865

作者：Haruto Yoshida,Keito Kudo,Yoichi Aoki,Ryota Tanaka,Itsumi Saito,Keisuke Sakaguchi,Kentaro Inui

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, demonstrate strong performance, Large vision-language, diagram understanding benchmarks, arrows and lines

备注：

点击查看摘要

54. 【2603.02843】Scale-invariant Gaussian derivative residual networks

链接：https://arxiv.org/abs/2603.02843

作者：Andrzej Perzanowski,Tony Lindeberg

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Gaussian derivative residual, image scales remains, Gaussian derivative, scale-invariant Gaussian derivative, scale-covariant Gaussian derivative

备注： 39 pages, 23 figures, 5 tables

点击查看摘要

Abstract:Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation.

Comments:
39 pages, 23 figures, 5 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2603.02843 [cs.CV]

(or
arXiv:2603.02843v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.02843

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Andrzej Perzanowski [view email] [v1]
Tue, 3 Mar 2026 10:39:41 UTC (2,155 KB)

55. 【2603.02829】oward Early Quality Assessment of Text-to-Image Diffusion Models

链接：https://arxiv.org/abs/2603.02829

作者：Huanlei Guo,Hongxin Wei,Bingyi Jing

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：natural language prompts, produce highly realistic, language prompts, natural language, Recent

备注：

点击查看摘要

Abstract:Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at this https URL.

56. 【2603.02816】BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

链接：https://arxiv.org/abs/2603.02816

作者：Zihao Zhu,Ruotong Wang,Siwei Lyu,Min Zhang,Baoyuan Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：revolutionized content creation, remains largely untapped, commercial potential remains, potential remains largely, content creation

备注：

点击查看摘要

Abstract:The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.

57. 【2603.02805】ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

链接：https://arxiv.org/abs/2603.02805

作者：Douglass Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：coordinate stream captured, touch input, lacks a unified, coordinate stream, stream captured

备注：

点击查看摘要

Abstract:Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).

58. 【2603.02803】Structure-Aware Text Recognition for Ancient Greek Critical Editions

链接：https://arxiv.org/abs/2603.02803

作者：Nicolas Angleraud,Antonia Karamolegkou,Benoît Sagot,Thibault Clérice

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, visual language models, advances in visual, visual language, Ancient Greek critical

备注：

点击查看摘要

Abstract:Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

59. 【2603.02802】NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

链接：https://arxiv.org/abs/2603.02802

作者：Tianlin Pan,Jiayi Dai,Chenpu Yuan,Zhengyao Lv,Binxin Yang,Hubery Yin,Chen Li,Jing Lyu,Caifeng Shan,Chenyang Si

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved impressive results, Recent video editing, large-scale paired datasets, require large-scale paired, Recent video

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \ Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

60. 【2603.02801】R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild

链接：https://arxiv.org/abs/2603.02801

作者：Margherita Lea Corona,Wieland Morgenstern,Peter Eisert,Anna Hilsmann

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieving outstanding rendering, Gaussian Splatting, achieving outstanding, fast training, leading technique

备注： Accepted at VISAPP 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary

61. 【2603.02795】VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

链接：https://arxiv.org/abs/2603.02795

作者：Ruiyang Zhang,Qianguo Sun,Chao Song,Yiyan Qi,Zhedong Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal, search, increasingly becoming autonomous, Large models, multimodal search

备注： 23 pages, 6 figures

点击查看摘要

Abstract:Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.

62. 【2603.02790】Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

链接：https://arxiv.org/abs/2603.02790

作者：Michelle Stegeman,Lena Philipp,Fennie van der Graaf,Marina D'Amato,Clément Grisi,Luc Builtjes,Joeran S. Bosma,Judith Lefkes,Rianne A. Weber,James A. Meakin,Thomas Koopman,Anne Mickan,Mathias Prokop,Ewoud J. Smit,Geert Litjens,Jeroen van der Laak,Bram van Ginneken,Maarten de Rooij,Henkjan Huisman,Colin Jacobs,Francesco Ciompi,Alessa Hering(and on behalf of the UNICORN consortium)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：learn broadly generalizable, broadly generalizable features, models show promise, Medical foundation models, features from large

备注： This paper describes the dataset and design of the UNICORN challenge and provides the link to Grand Challenge

点击查看摘要

Abstract:Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via this http URL.

63. 【2603.02785】HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning

链接：https://arxiv.org/abs/2603.02785

作者：Zihao Peng,Nan Zou,Jiandian Zeng,Guo Li,Ke Chen,Boyuan Li,Tian Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision tasks due, Vision Transformers, vision tasks, strong transferability, widely adopted

备注： Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA's design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.

64. 【2603.02767】ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

链接：https://arxiv.org/abs/2603.02767

作者：HanZpeng Liu,Yaqian Li,Zidan Wang,Shuoxi Zhang,Zonglin Zhao,Zihao Bo,Rinyoichi Takezoe,Kaiwen Long,Kun He

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remain partially organized, visual representation learning, visual representation, yield representations, dominant paradigm

备注：

点击查看摘要

Abstract:Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

65. 【2603.02754】Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

链接：https://arxiv.org/abs/2603.02754

作者：Yi Liu,Jing Zhang,Di Wang,Xiaoyu Tian,Haonan Guo,Bo Du

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, sensing visual question-answering, large language models, remote sensing visual, visual grounding failures

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: this https URL

66. 【2603.02748】GVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

链接：https://arxiv.org/abs/2603.02748

作者：HanZpeng Liu,Yaqian Li,Zidan Wang,Shuoxi Zhang,Zihao Bo,Rinyoichi Takezoe,Kaiwen Long,Kun He

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：instruction-agnostic vision encoders, Large Vision, success of Large, instruction-agnostic vision, existing architectures suffer

备注：

点击查看摘要

Abstract:Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

67. 【2603.02743】CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

链接：https://arxiv.org/abs/2603.02743

作者：Waqas Ahmed,Dean Diepeveen,Ferdous Sohel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Realistic shadow generation, achieving seamless image, existing methods primarily, methods primarily focus, Realistic shadow

备注：

点击查看摘要

Abstract:Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.

68. 【2603.02727】Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

链接：https://arxiv.org/abs/2603.02727

作者：Hongbo Zheng,Afshin Bozorgpour,Dorit Merhof,Minjia Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preserve fine anatomical, fine anatomical boundaries, image segmentation requires, segmentation requires models, requires models

备注：

点击查看摘要

Abstract:Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.

69. 【2603.02726】Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement

链接：https://arxiv.org/abs/2603.02726

作者：Hongying Zhang,ShuaiShuai Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：establish spatial correspondences, CVGL remains challenging, aims to establish, GNSS-denied environments, correspondences between images

备注：

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints and constitutes a fundamental technique for visual localization in GNSS-denied environments. Nevertheless, CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information. Existing methods predominantly rely on spatial domain feature alignment, which is inherently sensitive to large scale viewpoint variations and local disturbances. To alleviate these limitations, this paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains. SFDE adopts a three branch parallel architecture to model global semantic context, local geometric structure, and statistical stability in the frequency domain, respectively, thereby characterizing consistency across domains from the perspectives of scene topology, multiscale structural patterns, and frequency invariance. The resulting complementary features are jointly optimized in a unified embedding space via progressive enhancement and coupled constraints, enabling the learning of cross-view representations with consistency across multiple granularities. Comprehensive experiments show that SFDE achieves competitive performance and in many cases even surpasses state-of-the-art methods, while maintaining a lightweight and computationally efficient design. {Our code is available at this https URL

70. 【2603.02720】nExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework

链接：https://arxiv.org/abs/2603.02720

作者：Ting-Wei Zhou,Xi-Le Zhao,Sheng Liu,Wei-Hao Wu,Yu-Bang Zheng,Deyu Meng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：receive increasing attention, tensor decomposition structure, decomposition structure search, tensor decomposition, tensor

备注：

点击查看摘要

Abstract:Recently, tensor decompositions continue to emerge and receive increasing attention. Selecting a suitable tensor decomposition to exactly capture the low-rank structures behind the data is at the heart of the tensor decomposition field, which remains a challenging and relatively under-explored problem. Current tensor decomposition structure search methods are still confined by a fixed factor-interaction family (e.g., tensor contraction) and cannot deliver the mixture of decompositions. To address this problem, we elaborately design a mixture-of-experts-based tensor decomposition structure search framework (termed as TenExp), which allows us to dynamically select and activate suitable tensor decompositions in an unsupervised fashion. This framework enjoys two unique advantages over the state-of-the-art tensor decomposition structure search methods. Firstly, TenExp can provide a suitable single decomposition beyond a fixed factor-interaction family. Secondly, TenExp can deliver a suitable mixture of decompositions beyond a single decomposition. Theoretically, we also provide the approximation error bound of TenExp, which reveals the approximation capability of TenExp. Extensive experiments on both synthetic and realistic datasets demonstrate the superiority of the proposed TenExp compared to the state-of-the-art tensor decomposition-based methods.

71. 【2603.02712】From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

链接：https://arxiv.org/abs/2603.02712

作者：Ruxue Yan,Xubo Liu,Wenya Guo,Zhengkun Zhang,Ying Zhang,Xiaojie Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：reinforcement learning, Autoregressive image generation, input prompt, Autoregressive image, Constrained Reasoning

备注：

点击查看摘要

Abstract:Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify "What" details to depict by rewriting the input prompt, yet fundamentally fail to reason about "How" to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a "How-to-What" paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces "How to draw" by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description "What to draw", providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).

72. 【2603.02710】MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

链接：https://arxiv.org/abs/2603.02710

作者：Lingshun Kong,Jiawei Zhang,Zhengpeng Duan,Xiaohe Wu,Yueqi Yang,Xiaotao Wang,Dongqing Zou,Lei Lei,Jinshan Pan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：impose diverse requirements, making it difficult, image restoration, restoration strategies, degradation types

备注： Project website: [this https URL](https://github.com/kkkls/MIM-DiT)

点击查看摘要

Abstract:All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.

73. 【2603.02704】Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model

链接：https://arxiv.org/abs/2603.02704

作者：Yuhang Liu,Yueyang Cang,Wenge Que,Xinru Bai,Xingtong Wang,Kuisheng Chen,Jingya Li,Xiaoteng Zhang,Xinmin Li,Lixia Zhang,Pingge Hu,Qiaoting Xie,Peiyu Xu,Xianxu Zeng,Li Shi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：gestational trophoblastic disease, threatens maternal health, GTD pathological diagnosis, GTD pathological, trophoblastic disease

备注： 29 pages, 3 figures

点击查看摘要

Abstract:The pathological diagnosis of gestational trophoblastic disease(GTD) takes a long time, relies heavily on the experience of pathologists, and the consistency of initial diagnosis is low, which seriously threatens maternal health and reproductive outcomes. We developed an expert model for GTD pathological diagnosis, named GTDoctor. GTDoctor can perform pixel-based lesion segmentation on pathological slides, and output diagnostic conclusions and personalized pathological analysis results. We developed a software system, GTDiagnosis, based on this technology and conducted clinical trials. The retrospective results demonstrated that GTDiagnosis achieved a mean precision of over 0.91 for lesion detection in pathological slides (n=679 slides). In prospective studies, pathologists using GTDiagnosis attained a Positive Predictive Value of 95.59% (n=68 patients). The tool reduced average diagnostic time from 56 to 16 seconds per case (n=285 patients). GTDoctor and GTDiagnosis offer a novel solution for GTD pathological diagnosis, enhancing diagnostic performance and efficiency while maintaining clinical interpretability.

74. 【2603.02697】ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

链接：https://arxiv.org/abs/2603.02697

作者：Jiayi Zhu,Jianing Zhang,Yiying Yang,Wei Cheng,Xiaoyun Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：framework enabling multi-agent, paper presents ShareVerse, generation framework enabling, addressing the gap, shared world modeling

备注：

点击查看摘要

Abstract:This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

75. 【2603.02692】FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

链接：https://arxiv.org/abs/2603.02692

作者：Aro Kim,Myeongjin Jang,Chaewon Moon,Youngjin Shin,Jinwoo Jeong,Sang-hyo Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently driven remarkable, driven remarkable progress, approaches have recently, recently driven, driven remarkable

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: this https URL.

76. 【2603.02691】ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

链接：https://arxiv.org/abs/2603.02691

作者：Yong Eun Choi,Hyoung Suk Park,Kiwan Jeon,Hyun-Cheol Park,Sung Ho Kang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently shown strong, shown strong potential, deterministic degradation processes, generalized diffusion models, explicitly modeling deterministic

备注： 10 pages, 4 figures. Submitted to MICCAI 2026

点击查看摘要

Abstract:Cold and generalized diffusion models have recently shown strong potential for sparse-view CT reconstruction by explicitly modeling deterministic degradation processes. However, existing sampling strategies often rely on ad hoc sampling controls or fixed schedules, which remain sensitive to error accumulation and sampling instability. We propose ReCo-Diff, a residual-conditioned diffusion framework that leverages observation residuals through residual-conditioned self-guided sampling. At each sampling step, ReCo-Diff first produces a null (unconditioned) baseline reconstruction and then conditions subsequent predictions on the observation residual between the predicted image and the measured sparse-view input. This residual-driven guidance provides continuous, measurement-aware correction while preserving a deterministic sampling schedule, without requiring heuristic interventions. Experimental results demonstrate that ReCo-Diff consistently outperforms existing cold diffusion sampling baselines, achieving higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.

77. 【2603.02681】VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

链接：https://arxiv.org/abs/2603.02681

作者：Jinxiang Lai,Zexin Lu,Jiajun He,Rongwei Quan,Wenzhe Zhao,Qinyu Yang,Qi Chen,Qin Lin,Chuyue Li,Tao Gao,Yuhao Shan,Shuai Shao,Song Guo,Qinglin Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous creative planning, creative workflows-capabilities challenging, workflow-based agents lack, agents lack specialized, lack specialized knowledge

备注：

点击查看摘要

Abstract:Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

78. 【2603.02667】DREAM: Where Visual Understanding Meets Text-to-Image Generation

链接：https://arxiv.org/abs/2603.02667

作者：Chao Li,Tianhong Li,Sai Vidyaranya Nuthalapati,Hong-You Chen,Satya Narayan Shukla,Yonghuan Yang,Jun Xiao,Xiangjun Fan,Aashu Singh,Dina Katabi,Shlok Kumar Mishra

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Unifying visual representation, Unifying visual, single model remains, remains a central, central challenge

备注：

点击查看摘要

Abstract:Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

79. 【2603.02663】Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

链接：https://arxiv.org/abs/2603.02663

作者：Shunki Uebayashi,Kento Masui,Kyohei Atarashi,Han Bao,Hisashi Kashima,Naoto Inoue,Mayu Otani,Koh Takeuchi

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Large Language, general architectures capable, Language Models

备注： 24pages, 20 figures, accepted to ICLR2026

点击查看摘要

80. 【2603.02658】OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

链接：https://arxiv.org/abs/2603.02658

作者：Zhengwei Yang,Andi Long,Hao Li,Zechao Hu,Kui Jiang,Zheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：incomplete fashion annotations, spans multiple tasks, intelligence spans multiple, spans multiple, remains hindered

备注： 12 pages, 8 figures

点击查看摘要

Abstract:Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

81. 【2603.02648】SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation

链接：https://arxiv.org/abs/2603.02648

作者：Fengming Zhang,Tao Yan,Jianchao Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：presents significant challenges, Transparent object instance, object instance segmentation, including boundary blur, segmentation presents significant

备注： 5 pages, 4 figures,accepted to ISCAS 2026

点击查看摘要

Abstract:Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP-YOLO, a novel framework that integrates a dual-domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi-scale spatial refinement stream, which consists of a Content-Aware Alignment Neck and a Multi-scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high-quality instance-level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP-YOLO achieves state-of-the-art (SOTA) performance.

82. 【2603.02629】owards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

链接：https://arxiv.org/abs/2603.02629

作者：Kaifang Long,Lianbo Ma,Jiaqi Liu,Liming Liu,Guoyang Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：anomaly detection seeks, systematically detect anomalies, support incremental learning, accommodate emerging objects, multimodal anomaly detection

备注：

点击查看摘要

Abstract:The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.

83. 【2603.02619】Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

链接：https://arxiv.org/abs/2603.02619

作者：Seunguk Do,Minwoo Huh,Joonghyuk Shin,Jaesik Park

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, Direct Reward fine-tuning, human, reconstruction has achieved, achieved remarkable

备注： ICLR 2026, Project webpage: [this https URL](https://seunguk-do.github.io/drpose)

点击查看摘要

Abstract:Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: this https URL.

84. 【2603.02618】Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

链接：https://arxiv.org/abs/2603.02618

作者：Zhikang Xu,Qianqian Xu,Zitai Wang,Cong Hua,Sicong Li,Zhiyong Yang,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deploying machine learning, machine learning models, unknown classes, open-world scenarios, OOD detection

备注： Accepted by the main track of CVPR 2026

点击查看摘要

Abstract:Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.

85. 【2603.02609】VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

链接：https://arxiv.org/abs/2603.02609

作者：A. Enes Doruk,Hasan F. Ates

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：semantic occupancy prediction, autonomous driving, prediction in autonomous, occupancy prediction, robust multimodal framework

备注：

点击查看摘要

Abstract:This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

86. 【2603.02598】Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation

链接：https://arxiv.org/abs/2603.02598

作者：Taowen Zeng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：study companion devices, AI-powered study companion, collecting large-scale annotated, large-scale annotated datasets, ethically prohibitive due

备注： 16 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Accurate child posture estimation is critical for AI-powered study companion devices, yet collecting large-scale annotated datasets of children is both expensive and ethically prohibitive due to privacy concerns. We present Synthetic-Child, an AIGC-based synthetic data pipeline that produces photorealistic child posture training images with ground-truth-projected keypoint annotations, requiring zero real child photographs. The pipeline comprises four stages: (1) a programmable 3D child body model (SMPL-X) in Blender generates diverse desk-study poses with IK-constrained anatomical plausibility and automatic COCO-format ground-truth export; (2) a custom PoseInjectorNode feeds 3D-derived skeletons into a dual ControlNet (pose + depth) conditioned on FLUX-1 Dev, synthesizing 12,000 photorealistic images across 10 posture categories with low annotation drift; (3) ViTPose-based confidence filtering and targeted augmentation remove generation failures and improve robustness; (4) RTMPose-M (13.6M params) is fine-tuned on the synthetic data and paired with geometric feature engineering and a lightweight MLP for posture classification, then quantized to INT8 for real-time edge deployment. On a real-child test set (n~300), the FP16 model achieves 71.2 AP -- a +12.5 AP improvement over the COCO-pretrained adult-data baseline at identical model capacity. After INT8 quantization the model retains 70.4 AP while running at 22 FPS on a 0.8-TOPS Rockchip RK3568 NPU. In a single-subject controlled comparison with a commercial posture corrector, our system achieves substantially higher recognition rates across most tested categories and responds ~1.8x faster on average. These results demonstrate that carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with potential applications to other privacy-sensitive domains.

87. 【2603.02591】Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification

链接：https://arxiv.org/abs/2603.02591

作者：Rafi Hassan Chowdhury,Naimul Haque,Kaniz Fatiha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision tasks, computer vision, convolutional neural networks, deep convolutional neural, vision tasks

备注：

点击查看摘要

Abstract:Deep learning models have proven to be highly effective in computer vision, with deep convolutional neural networks achieving impressive results across various computer vision tasks. However, these models rely heavily on large datasets to avoid overfitting. When a model learns features with either low or high variance, it can lead to underfitting or overfitting on the training data. Unfortunately, large-scale datasets may not be available in many domains, particularly for resource-limited languages such as Bengali. In this experiment, a series of tests were conducted in the field of image data augmentation as an approach to addressing the limited data problem for Bengali handwritten characters. The study also provides an in-depth analysis of the performance of different augmentation techniques. Data augmentation refers to a set of techniques applied to data to increase its size and diversity, making it more suitable for training deep learning models. The image augmentation techniques evaluated in this study include CLAHE, Random Rotation, Random Affine, Color Jitter, and their combinations. The study further explores the use of augmentation methods with a lightweight model such as EfficientViT. Among the different augmentation strategies, the combination of Random Affine and Color Jitter produced the best accuracy on the Ekush [1] and AIBangla [2] datasets, achieving accuracies of 97.48% and 97.57%, respectively. This combination outperformed all other individual and combined augmentation techniques. Overall, this analysis presents a thorough examination of the impact of image data augmentation in resource-scarce languages, particularly in the context of Bengali handwritten character recognition using lightweight models.

88. 【2603.02582】Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction

链接：https://arxiv.org/abs/2603.02582

作者：Zhe Chen,Peilin Zheng,Wenshuo Chen,Xiucheng Wang,Yutao Yue,Nan Cheng

类目：Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：Creating functional Digital, Creating functional, computer vision, central challenge, challenge in computer

备注： 10 pages, 5 figures

点击查看摘要

Abstract:Creating functional Digital Twins, simulatable 3D replicas of the real world, is a central challenge in computer vision. Current methods like NeRF produce visually rich but functionally incomplete twins. The key barrier is the lack of underlying material properties (e.g., permittivity, conductivity). Acquiring this information for every point in a scene via non-contact, non-invasive sensing is a primary goal, but it demands solving a notoriously ill-posed physical inversion problem. Standard remote signals, like images and radio frequencies (RF), deeply entangle the unknown geometry, ambient field, and target materials. We introduce NEMF, a novel framework for dense, non-invasive physical inversion designed to build functional digital twins. Our key insight is a systematic disentanglement strategy. NEMF leverages high-fidelity geometry from images as a powerful anchor, which first enables the resolution of the ambient field. By constraining both geometry and field using only non-invasive data, the original ill-posed problem transforms into a well-posed, physics-supervised learning task. This transformation unlocks our core inversion module: a decoder. Guided by ambient RF signals and a differentiable layer incorporating physical reflection models, it learns to explicitly output a continuous, spatially-varying field of the scene's underlying material parameters. We validate our framework on high-fidelity synthetic datasets. Experiments show our non-invasive inversion reconstructs these material maps with high accuracy, and the resulting functional twin enables high-fidelity physical simulation. This advance moves beyond passive visual replicas, enabling the creation of truly functional and simulatable models of the physical world.

89. 【2603.02581】ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

链接：https://arxiv.org/abs/2603.02581

作者：Leheng Zhang,Wei Long,Yawei Li,Xingyu Zhou,Xiaorui Zhao,Shuhang Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained significant popularity, Transformers have gained, gained significant, significant popularity, image

备注： 16 pages, 10 figures

点击查看摘要

Abstract:Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.

90. 【2603.02573】rack4World: Feedforward World-centric Dense 3D Tracking of All Pixels

链接：https://arxiv.org/abs/2603.02573

作者：Jiahao Lu,Jiayi Xu,Wenbo Hu,Ruijie Zhu,Chengfeng Zhao,Sai-Kit Yeung,Ying Shan,Yuan Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：crucial and promising, comprehensive understanding, Estimating, tracking, tracking sparse points

备注： Project Page: [this https URL](https://jiah-cloud.github.io/Track4World.github.io/)

点击查看摘要

Abstract:Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

91. 【2603.02561】SOLAR: SVD-Optimized Lifelong Attention for Recommendation

链接：https://arxiv.org/abs/2603.02561

作者：Chenghao Zhang,Chao Feng,Yuanhao Pu,Xunyong Yang,Wenhui Yu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：global credit assignment, expressive global credit, makes long-context modeling, operator in Transformers, long-context modeling expensive

备注： 18 pages, 4 figures

点击查看摘要

92. 【2603.02560】CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration

链接：https://arxiv.org/abs/2603.02560

作者：Huichun Liu,Xiaosong Li,Zhuangfan Huang,Tao Ye,Yang Liu,Haishu Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Image Fusion, informative fused images, integrates complementary information, Multimodal Image, adverse weather

备注：

点击查看摘要

Abstract:Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at this https URL.

93. 【2603.02557】CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

链接：https://arxiv.org/abs/2603.02557

作者：Maoyuan Shao,Yutong Gao,Xinyang Huang,Chuang Zhu,Lijuan Sun,Guoshun Nan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：cross-modal representation learning, achieved remarkable progress, CLIP have achieved, semantically similar categories, Vision-language models

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at this https URL.

94. 【2603.02556】hrough the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

链接：https://arxiv.org/abs/2603.02556

作者：Zhiyu Pan,Yizheng Wu,Jiashen Hua,Junyi Feng,Shaotian Yan,Bing Deng,Zhiguo Cao,Jieping Ye

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Reasoning, large language models, visual, visual reasoning, reasoning paths

备注： 19 pages, 9 figures, accepted to ICLR 2026 (oral)

点击查看摘要

95. 【2603.02554】Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

链接：https://arxiv.org/abs/2603.02554

作者：Chonghua Lv,Dong Zhao,Shuang Wang,Dou Quan,Ning Huyan,Nicu Sebe,Zhun Zhong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：approaches primarily preserve, primarily preserve in-domain, preserve in-domain accuracy, compress large models, conventional approaches primarily

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at this https URL.

96. 【2603.02553】Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery

链接：https://arxiv.org/abs/2603.02553

作者：Xuejin Luo,Shiquan Sun,Runshi Zhang,Ruizhi Zhang,Junchen Wang

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：Robotic scrub nurses, frequently deliver surgical, scrub nurses, decreased focus, required to frequently

备注： 8 pages, 10 figures. Accepted by IEEE International Conference on Robotics and Automation (ICRA), 2026

点击查看摘要

Abstract:During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at this https URL.

97. 【2603.02548】SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding

链接：https://arxiv.org/abs/2603.02548

作者：Sheng Ye,Zhen-Hui Dong,Ruoyu Fan,Tian Lv,Yong-Jin Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex environments, essential for robots, robots to operate, operate effectively, effectively and safely

备注： ICRA 2026

点击查看摘要

Abstract:Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.

98. 【2603.02546】On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

链接：https://arxiv.org/abs/2603.02546

作者：Zhanzhong Pang,Dibyadip Chatterjee,Fadime Sener,Angela Yao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Multimodal Large, Language Models, Large Language

备注： 22 pages, 9 figures, 16 tables. Accepted by ICLR2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.

99. 【2603.02541】ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection

链接：https://arxiv.org/abs/2603.02541

作者：Deokyun Kim,Jeongjun Lee,Jungwon Choi,Jonggeon Park,Giyoung Lee,Yookyung Kim,Myungseok Ki,Juho Lee,Jihun Cha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unmanned Aerial Vehicles, imagery typically captured, captured by Unmanned, oblique aerial imagery, aerial imagery typically

备注： ICLR 2026 Accepted

点击查看摘要

Abstract:Detecting missing persons in forest environments remains a challenge, as dense canopy cover often conceals individuals from detection in top-down or oblique aerial imagery typically captured by Unmanned Aerial Vehicles (UAVs). While UAVs are effective for covering large, inaccessible areas, their aerial perspectives often miss critical visual cues beneath the forest canopy. This limitation underscores the need for under-canopy perspectives better suited for detecting missing persons in such environments. To address this gap, we introduce ForestPersons, a novel large-scale dataset specifically designed for under-canopy person detection. ForestPersons contains 96,482 images and 204,078 annotations collected under diverse environmental and temporal conditions. Each annotation includes a bounding box, pose, and visibility label for occlusion-aware analysis. ForestPersons provides ground-level and low-altitude perspectives that closely reflect the visual conditions encountered by Micro Aerial Vehicles (MAVs) during forest Search and Rescue (SAR) missions. Our baseline evaluations reveal that standard object detection models, trained on prior large-scale object detection datasets or SAR-oriented datasets, show limited performance on ForestPersons. This indicates that prior benchmarks are not well aligned with the challenges of missing person detection under the forest canopy. We offer this benchmark to support advanced person detection capabilities in real-world SAR scenarios. The dataset is publicly available at this https URL.

100. 【2603.02533】Functional Properties of the Focal-Entropy

链接：https://arxiv.org/abs/2603.02533

作者：Jaimin Shah,Martina Cardone,Alex Dytso

类目：Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)

关键词：class-imbalanced classification problems, classification problems, computer vision, widely used alternative, class-imbalanced classification

备注： Accepted to AISTATS 2026

点击查看摘要

Abstract:The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.

101. 【2603.02532】EIMC: Efficient Instance-aware Multi-modal Collaborative Perception

链接：https://arxiv.org/abs/2603.02532

作者：Kang Yang,Peng Wang,Lantao Li,Tianci Bu,Chen Sun,Deying Li,Yongcai Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：collaborative perception calls, autonomous driving, Multi-modal collaborative perception, perception calls, calls for great

备注： 9 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication'' sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01\% AP@0.5 while reducing byte bandwidth usage by 87.98\% compared with the best published multi-modal collaborative detector. Code publicly released at this https URL.

102. 【2603.02522】NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

链接：https://arxiv.org/abs/2603.02522

作者：Liang Zeng,Valerio Marsocci,Wufan Zhao,Andrea Nascetti,Maarten Vergauwen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Masked Image Modeling, Earth Observation images, unlabeled Earth Observation, Earth Observation, Image Modeling

备注：

点击查看摘要

Abstract:Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.

103. 【2603.02518】Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks

链接：https://arxiv.org/abs/2603.02518

作者：Syeda Hareem Madani,Noureen Bibi,Adam Rafiq Jeraj,Sumra Khan,Anas Zafar,Rizwan Qureshi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autism Spectrum Disorder, Autism Spectrum, Spectrum Disorder, idiosyncratic connectivity patterns, brain parcellations dominate

备注： 10 pages

点击查看摘要

Abstract:Anatomical brain parcellations dominate rs-fMRI-based Autism Spectrum Disorder (ASD) classification, yet their rigid boundaries may fail to capture the idiosyncratic connectivity patterns that characterise ASD. We present a graph-based deep learning framework comparing anatomical (AAL, 116 ROIs) and functionally-derived (MSDL, 39 ROIs) parcellation strategies on the ABIDE I dataset. Our FSL preprocessing pipeline handles multi-site heterogeneity across 400 balanced subjects, with site-stratified 70/15/15 splits to prevent data leakage. Gaussian noise augmentation within training folds expands samples from 280 to 1,680. A three phase pipeline progresses from a baseline GCN with AAL (73.3% accuracy, AUC=0.74), to an optimised GCN with MSDL (84.0%, AUC=0.84), to a Graph Attention Network ensemble achieving 95.0% accuracy (AUC=0.98), outperforming all recent GNN-based benchmarks on ABIDE I. The 10.7-point gain from atlas substitution alone demonstrates that functional parcellation is the most impactful modelling decision. Gradient-based saliency and GNNExplainer analyses converge on the Posterior Cingulate Cortex and Precuneus as core Default Mode Network hubs, validating that model decisions reflect ASD neuropathology rather than acquisition artefacts. All code and datasets will be publicly released upon acceptance.

104. 【2603.02505】SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data

链接：https://arxiv.org/abs/2603.02505

作者：Lekang Wen,Liang Liao,Jing Xiao,Mi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sensing Earth observation, remote sensing Earth, Multimodal semantic segmentation, Earth observation, sensing Earth

备注：

点击查看摘要

Abstract:Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra-class variation in scale, shape, and orientation across modalities; and (3) cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over-alignment, discarding modality-specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra-class variation and cross-modal heterogeneity. To address these limitations, we propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance. SGMA introduces two complementary plug-and-play modules: (1) Semantic-Guided Fusion (SGF) module extracts multi-scale, class-wise semantic prototypes that capture consistent categorical representations across modalities, estimates per-modality robustness based on prototype-feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra-class variation and cross-modal heterogeneity; (2) Modality-Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.

105. 【2603.02497】WTHaar-Net: a Hybrid Quantum-Classical Approach

链接：https://arxiv.org/abs/2603.02497

作者：Vittorio Palladino,Tsai Idden,Ahmet Enis Cetin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Haar Wavelet Transform, neural networks rely, suitable transform domains, Convolutional neural networks, convolutional neural network

备注： 16 pages, 5 images

点击查看摘要

Abstract:Convolutional neural networks rely on linear filtering operations that can be reformulated efficiently in suitable transform domains. At the same time, advances in quantum computing have shown that certain structured linear transforms can be implemented with shallow quantum circuits, opening the door to hybrid quantum-classical approaches for enhancing deep learning models. In this work, we introduce WTHaar-Net, a convolutional neural network that replaces the Hadamard Transform used in prior hybrid architectures with the Haar Wavelet Transform (HWT). Unlike the Hadamard Transform, the Haar transform provides spatially localized, multi-resolution representations that align more closely with the inductive biases of vision tasks. We show that the HWT admits a quantum realization using structured Hadamard gates, enabling its decomposition into unitary operations suitable for quantum circuits. Experiments on CIFAR-10 and Tiny-ImageNet demonstrate that WTHaar-Net achieves substantial parameter reduction while maintaining competitive accuracy. On Tiny-ImageNet, our approach outperforms both ResNet and Hadamard-based baselines. We validate the quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices.

Comments:
16 pages, 5 images

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.02497 [cs.CV]

(or
arXiv:2603.02497v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.02497

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

106. 【2603.02482】MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

链接：https://arxiv.org/abs/2603.02482

作者：Zhongxi Wang,Yueqian Lin,Jingyang Zhang,Hai Helen Li,Yiran Chen

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：remain predominantly text-centric, Unified Safety Evaluation, large language models, language models remain, models remain predominantly

备注： Submitted to ACL 2026 System Demonstration Track

点击查看摘要

107. 【2603.02481】ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop

链接：https://arxiv.org/abs/2603.02481

作者：Shuangzhi Li,Lei Ma,Xingyu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：integrating complementary sensors, autonomous driving, integrating complementary, LiDAR and cameras, pivotal for autonomous

备注：

点击查看摘要

Abstract:Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.

108. 【2603.02477】E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition

链接：https://arxiv.org/abs/2603.02477

作者：Mubarak Olaoluwa,Hassen Drira

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently gained significant, gained significant attention, computer vision community, capture meaningful representations, Geometric deep learning

备注：

点击查看摘要

Abstract:Geometric deep learning has recently gained significant attention in the computer vision community for its ability to capture meaningful representations of data lying in a non-Euclidean space. To this end, we propose E2E-GNet, an end-to-end geometric deep neural network for skeleton-based human motion recognition. To enhance the discriminative power between different motions in the non-Euclidean space, E2E-GNet introduces a geometric transformation layer that jointly optimizes skeleton motion sequences on this space and applies a differentiable logarithm map activation to project them onto a linear space. Building on this, we further design a distortion-aware optimization layer that limits skeleton shape distortions caused by this projection, enabling the network to retain discriminative geometric cues and achieve a higher motion recognition rate. We demonstrate the impact of each layer through ablation studies and extensive experiments across five datasets spanning three domains show that E2E-GNet outperforms other methods with lower cost.

109. 【2603.02475】Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild

链接：https://arxiv.org/abs/2603.02475

作者：Vitor Pereira Matias,Márcus Vinícius Lobo Costa,João Batista Neto,Tiago Novello de Brito

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Deep learning, inherit biases, classic computer vision, Deep, Deep learning models

备注： 12 pages, 11 figures

点击查看摘要

Abstract:Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon

110. 【2603.02465】Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning

链接：https://arxiv.org/abs/2603.02465

作者：Emadeldeen Hamdan,Ahmad Faiz Tharima,Mohd Zahirasri Mohd Tohir,Dayang Nur Sakinah Musa,Erdem Koyuncu,Adam J. Watts,Ahmet Enis Cetin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：based wildfire detection, peatland fire, wildfire detection methods, Machine learning, peatland fire detection

备注：

点击查看摘要

Abstract:Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics -- such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning -- that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.

111. 【2603.02438】ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

链接：https://arxiv.org/abs/2603.02438

作者：Aymen Lassoued,Mohamed Ali Souibgui,Yousri Kessentini

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Question Answering, Document Visual Question, existing Vision-Language Models, Question Answering, Visual Question

备注：

点击查看摘要

Abstract:Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.

112. 【2603.02434】MIRAGE: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimer's Disease Prediction

链接：https://arxiv.org/abs/2603.02434

作者：Guanchen Wu,Zhe Huang,Yuzhang Xie,Runze Yan,Akul Chopra,Deqiang Qiu,Xiao Hu,Fei Wang,Carl Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Magnetic Resonance Imaging, Reliable Alzheimer disease, Electronic Health Records, structural Magnetic Resonance, Reliable Alzheimer

备注：

点击查看摘要

Abstract:Reliable Alzheimer's disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled "diagnostic-surrogate" representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.

113. 【2603.02430】A Unified Revisit of Temperature in Classification-Based Knowledge Distillation

链接：https://arxiv.org/abs/2603.02430

作者：Logan Frank,Jim Davis

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：expose relational structure, relational structure embedded, central idea, expose relational, relational structure

备注：

点击查看摘要

Abstract:A central idea of knowledge distillation is to expose relational structure embedded in the teacher's weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

114. 【2603.02419】DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting

链接：https://arxiv.org/abs/2603.02419

作者：Rui-Feng Wang,Daniel Petti,Yue Chen,Changying Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, Vision Foundation, remain insufficiently understood, Foundation Models trained, large-scale self-supervised learning

备注： 16 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.

115. 【2603.02413】ruckDrive: Long-Range Autonomous Highway Driving Dataset

链接：https://arxiv.org/abs/2603.02413

作者：Filippo Ghilotti,Edoardo Palladin,Samuel Brucker,Adam Sigal,Mario Bijelic,Felix Heide

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safe braking margins, heavy trucks remains, Safe highway autonomy, long braking distances, safe braking

备注：

点击查看摘要

Abstract:Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.

116. 【2603.02411】From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness

链接：https://arxiv.org/abs/2603.02411

作者：My H. Dinh,Aditya Sant,Akshay Malhotra,Keya Patani,Shahab Hamidi-Rad

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：compresses large datasets, maintain training performance, compresses large, Dataset Distillation, Quantization-aware Dataset Distillation

备注： Accepted to CVPR 2026 - Findings Workshop

点击查看摘要

Abstract:Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

117. 【2603.02390】OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments

链接：https://arxiv.org/abs/2603.02390

作者：Hymalai Bello,Lala Ray,Joanna Sorysz,Sungho Suh,Paul Lukowicz

类目：Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：Smart factories, factories use advanced, advanced technologies, technologies to optimize, optimize production

备注： Accepted in CVPR 2026

点击查看摘要

Abstract:Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer's instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other's progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.

118. 【2603.02386】Advancing Earth Observation Through Machine Learning: A TorchGeo Tutorial

链接：https://arxiv.org/abs/2603.02386

作者：Caleb Robinson,Nils Lehmann,Adam J. Stewart,Burak Ekim,Heng Fang,Isaac A. Corley,Mauricio Cordeiro

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision workflows, standard computer vision, pipelines differ fundamentally, Earth observation machine, machine learning pipelines

备注： Accepted at ICLR ML4RS 2026 Tutorial Track

点击查看摘要

Abstract:Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch-based domain library that provides datasets, samplers, transforms and pre-trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end-to-end case study on multispectral water segmentation from Sentinel-2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel-2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: this https URL and this https URL.

119. 【2603.02378】Authenticated Contradictions from Desynchronized Provenance and Watermarking

链接：https://arxiv.org/abs/2603.02378

作者：Alexander Nemecek,Hengzhi He,Guang Cheng,Erman Ayday

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

关键词：content authentication, invisible watermarking, watermarking are positioned, positioned as complementary, complementary defenses

备注： 11 pages

点击查看摘要

Abstract:Cryptographic provenance standards such as C2PA and invisible watermarking are positioned as complementary defenses for content authentication, yet the two verification layers are technically independent: neither conditions on the output of the other. This work formalizes and empirically demonstrates the $\textit{Integrity Clash}$, a condition in which a digital asset carries a cryptographically valid C2PA manifest asserting human authorship while its pixels simultaneously carry a watermark identifying it as AI-generated, with both signals passing their respective verification checks in isolation. We construct metadata washing workflows that produce these authenticated fakes through standard editing pipelines, requiring no cryptographic compromise, only the semantic omission of a single assertion field permitted by the current C2PA specification. To close this gap, we propose a cross-layer audit protocol that jointly evaluates provenance metadata and watermark detection status, achieving 100% classification accuracy across 3,500 test images spanning four conflict-matrix states and three realistic perturbation conditions. Our results demonstrate that the gap between these verification layers is unnecessary and technically straightforward to close.

120. 【2603.02371】Aligning Fetal Anatomy with Kinematic Tree Log-Euclidean PolyRigid Transforms

链接：https://arxiv.org/abs/2603.02371

作者：Yingcheng Liu,Athena Taymourtash,Yang Liu,Esra Abaci Turk,William M. Wells,Leo Joskowicz,P. Ellen Grant,Polina Golland

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Automated analysis, Automated, Skinned Multi-Person Linear, Kinematic Tree-based Log-Euclidean, medical imaging

备注：

点击查看摘要

Abstract:Automated analysis of articulated bodies is crucial in medical imaging. Existing surface-based models often ignore internal volumetric structures and rely on deformation methods that lack anatomical consistency guarantees. To address this problem, we introduce a differentiable volumetric body model based on the Skinned Multi-Person Linear (SMPL) formulation, driven by a new Kinematic Tree-based Log-Euclidean PolyRigid (KTPolyRigid) transform. KTPolyRigid resolves Lie algebra ambiguities associated with large, non-local articulated motions, and encourages smooth, bijective volumetric mappings. Evaluated on 53 fetal MRI volumes, KTPolyRigid yields deformation fields with significantly fewer folding artifacts. Furthermore, our framework enables robust groupwise image registration and a label-efficient, template-based segmentation of fetal organs. It provides a robust foundation for standardized volumetric analysis of articulated bodies in medical imaging.

121. 【2603.02370】Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples

链接：https://arxiv.org/abs/2603.02370

作者：Phillip Howard,Xin Su,Kathleen C. Fraser

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, grown increasingly powerful, Large Vision-Language, exhibit harmful biases, cultural

备注：

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have grown increasingly powerful in recent years, but can also exhibit harmful biases. Prior studies investigating such biases have primarily focused on demographic traits related to the visual characteristics of a person depicted in an image, such as their race or gender. This has left biases related to cultural differences (e.g., religion, socioeconomic status), which cannot be readily discerned from an individual's appearance alone, relatively understudied. A key challenge in measuring cultural biases is that determining which group an individual belongs to often depends upon cultural context cues in images, and datasets annotated with cultural context cues are lacking. To address this gap, we introduce Cultural Counterfactuals: a high-quality synthetic dataset containing nearly 60k counterfactual images for measuring cultural biases related to religion, nationality, and socioeconomic status. To ensure that cultural contexts are accurately depicted, we generate our dataset using an image-editing model to place people of different demographics into real cultural context images. This enables the construction of counterfactual image sets which depict the same person in multiple different contexts, allowing for precise measurement of the impact that cultural context differences have on LVLM outputs. We demonstrate the utility of Cultural Counterfactuals for quantifying cultural biases in popular LVLMs.

122. 【2603.02367】Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment

链接：https://arxiv.org/abs/2603.02367

作者：Yaxi Chen,Simin Ni,Jingjing Zhang,Shaheer U. Saeed,Yipei Wang,Aleksandra Ivanova,Rikin Hargunani,Chaozong Liu,Jie Huang,Yipeng Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：quantify image appearance, Classical radiomic features, Classical radiomic, intensity patterns, designed to quantify

备注：

点击查看摘要

Abstract:Classical radiomic features are designed to quantify image appearance and intensity patterns. Compared with end-to-end deep learning (DL) models trained for disease classification, radiomics pipelines with low-dimensional parametric classifiers offer enhanced transparency and interpretability, yet often underperform because of the reliance on population-level predefined feature sets. Recent work on adaptive radiomics uses DL to predict feature weights over a radiomic pool, then thresholds these weights to retain the top-k features from large radiomic pool F (often ~10^3). However, such marginal ranking can over-admit redundant descriptors and overlook complementary feature interactions. We propose a patient-specific feature-set selection framework that predicts a single compact feature set per subject, targeting complementary and diverse evidence rather than marginal top-k features. To overcome the intractable combinatorial search space of F choose k features, our method utilizes a 2-stage retrieval strategy: randomly sample diverse candidate feature sets, then rank these sets with a learned scoring function to select a high-performing feature set for the specific patient. The system consists of a feature-set scorer, and a classifier that performs the final diagnosis. We empirically show that the proposed two-stage retrieval approximates the original exhaustive all k-feature selection. Validating on tasks including ACL tear detection and KL grading for osteoarthritis, the experimental results achieve diagnostic performance, outperforming the top-k approach with the same k values, and competitive with end-to-end DL models while maintaining high transparency. The model generates auditable feature sets that link clinical outcomes to specific anatomical regions and radiomic families, allowing clinicians to inspect which anatomical structures and quantitative descriptors drive the prediction.

123. 【2603.02363】Beyond Caption-Based Queries for Video Moment Retrieval

链接：https://arxiv.org/abs/2603.02363

作者：David Pujol-Perich,Albert Clapés,Dima Damen,Sergio Escalera,Michael Wray

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：existing VMR methods, VMR methods, DETR architectures, existing VMR, public VMR datasets

备注： CVPR 2026 Camera-ready version

点击查看摘要

Abstract:In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: this https URL

124. 【2603.02351】MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

链接：https://arxiv.org/abs/2603.02351

作者：Leo Kaixuan Cheng,Abdus Shaikh,Ruofan Liang,Zhijie Wu,Yushi Guan,Nandita Vijaykumar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advancements, achieved impressive accuracy, achieved impressive, Recent, including transformer-based models

备注： Project page: [this https URL](https://leochengkx.github.io/MERG3R/)

点击查看摘要

Abstract:Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

125. 【2603.02337】Preconditioned Score and Flow Matching

链接：https://arxiv.org/abs/2603.02337

作者：Shadab Ahamed,Eshed Gal,Simon Ghyselincks,Md Shahriar Rahim Siddiqui,Moshe Eliasof,Eldad Haber

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：score-based diffusion train, diffusion train vector, train vector fields, intermediate distributions, score-based diffusion

备注： 24 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $\Sigma_t$ of $p_t$ governs optimization bias: when $\Sigma_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $\Sigma_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

126. 【2603.02329】HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

链接：https://arxiv.org/abs/2603.02329

作者：Lei Yao,Yong Chen,Yuejiao Su,Yi Wang,Moyun Liu,Lap-Pui Chau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Humans commonly identify, Humans commonly, commonly identify, generically generalized, observed interactions

备注： Accepted by CVPR 2026. Project Page: [this https URL](https://rayyoh.github.io/Hammer)

点击查看摘要

Abstract:Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.

127. 【2603.02288】AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning

链接：https://arxiv.org/abs/2603.02288

作者：Paul Friedrich,Florentin Bieder,Florian M. Thieringer,Philippe C. Cattin

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Facial feminization surgery, reshape craniofacial structures, Facial feminization, gender diverse patients, feminization surgery

备注： Code: [this https URL](https://github.com/pfriedri/autoffs)

点击查看摘要

Abstract:Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation and a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.

128. 【2603.02286】Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

链接：https://arxiv.org/abs/2603.02286

作者：Yaoteng Zhang,Zhou Qing,Junyu Gao,Qi Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：forgetting previously learned, Incremental Object Detection, Object Detection, aims to continuously, categories without forgetting

备注： Our paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2\% AP improvement) and PASCAL VOC (with a 3.3\% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: this https URL\_IOD/tree/main

129. 【2603.02270】From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

链接：https://arxiv.org/abs/2603.02270

作者：Vasiliy Kudryavtsev,Kirill Borodin,German Berezin,Kirill Bubenchikov,Grach Mkrtchian,Alexander Ryzhkov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated animal identification, limited dataset scale, reuniting lost pets, unimodal visual cues, Automated animal

备注： Published at MDPI Journal of Imaging (see at [this https URL](https://www.mdpi.com/2313-433X/12/1/30) )

点击查看摘要

Abstract:Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28\% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11\% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.

130. 【2603.02263】Social-JEPA: Emergent Geometric Isomorphism

链接：https://arxiv.org/abs/2603.02263

作者：Haoran Zhang,Youjin Wang,Yi Duan,Rong Fu,Dianyu Zhao,Sicheng Fan,Shuaishuai Cao,Wentao Guo,Xiao Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：World models compress, anticipate future observations, compress rich sensory, rich sensory streams, models compress rich

备注：

点击查看摘要

Abstract:World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at this https URL.

131. 【2603.02256】CamDirector: Towards Long-Term Coherent Video Trajectory Editing

链接：https://arxiv.org/abs/2603.02256

作者：Zhihao Shi,Kejia Yin,Weilin Wan,Yuhongze Zhou,Yuanhao Yu,Xinxin Zuo,Qiang Sun,Juwei Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：upgrading amateur footage, plausibly inpainting previously, inpainting previously unseen, previously unseen regions, professionally styled videos

备注：

点击查看摘要

Abstract:Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

132. 【2603.02220】Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

链接：https://arxiv.org/abs/2603.02220

作者：Yixin Wang,Yifan Hu,Peiyuan Liu,Naiqi Li,Dai Tao,Shu-Tao Xia

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Time series forecasting, challenging problem due, Time series, remains a challenging, intraperiod-fluctuations and interperiod-trends

备注：

点击查看摘要

Abstract:Time series forecasting (TSF) remains a challenging problem due to the intricate entanglement of intraperiod-fluctuations and interperiod-trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal this http URL, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a continuous latent surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art performance.

133. 【2512.03101】ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification

链接：https://arxiv.org/abs/2512.03101

作者：Congjing Zhang,Feng Lin,Xinyi Zhao,Pei Guo,Wei Li,Lin Chen,Chaoyue Zhao,Shuai Huang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, developing multi-modal LLM, visual anomaly detection, greatly stimulated research, stimulated research interest

备注：

点击查看摘要

Abstract:The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM's superior performance and its generic applicability across different domains for reliable decision-making.

134. 【2603.02499】Biomechanically Accurate Gait Analysis: A 3d Human Reconstruction Framework for Markerless Estimation of Gait Parameters

链接：https://arxiv.org/abs/2603.02499

作者：Akila Pemasiri,Ethan Goan,Glen Lichtwark,Robert Schuster,Luke Kelly,Clinton Fookes

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：human reconstruction, paper presents, reconstruction from video, video data, Abstract

备注：

点击查看摘要

Abstract:This paper presents a biomechanically interpretable framework for gait analysis using 3D human reconstruction from video data. Unlike conventional keypoint based approaches, the proposed method extracts biomechanically meaningful markers analogous to motion capture systems and integrates them within OpenSim for joint kinematic estimation. To evaluate performance, both spatiotemporal and kinematic gait parameters were analysed against reference marker-based data. Results indicate strong agreement with marker-based measurements, with considerable improvements when compared with pose-estimation methods alone. The proposed framework offers a scalable, markerless, and interpretable approach for accurate gait assessment, supporting broader clinical and real world deployment of vision based biomechanics

135. 【2603.02483】Geometric structures and deviations on James' symmetric positive-definite matrix bicone domain

链接：https://arxiv.org/abs/2603.02483

作者：Jacek Karwowski,Frank Nielsen

类目：Machine Learning (stat.ML); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：including signal processing, matrix datasets play, numerous scientific disciplines, Symmetric positive-definite, computer vision

备注： 35 pages, 4 figures

点击查看摘要

Abstract:Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James' bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.

136. 【2603.02294】Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

链接：https://arxiv.org/abs/2603.02294

作者：Nikhileswara Rao Sulake

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：multi-label chest X-ray, Long-tailed class distributions, clinically important findings, chest X-ray, class distributions pose

备注： This paper would be a part of the CXR Long Tail Challenge in ISBI 2026. This is my team report of it's work during the challenge

点击查看摘要

Abstract:Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. In this work, we present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest. Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst all 68 participating teams with total of 1528 submissions. We provide a candid analysis of the development-to-test performance gap and discuss practical insights for handling class imbalance in clinical imaging settings. Code is available at this https URL.