本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新582篇论文，其中：

自然语言处理67篇
信息检索15篇
计算机视觉172篇

自然语言处理

1. 【2603.25737】raining the Knowledge Base through Evidence Distillation and Write-Back Enrichment

链接：https://arxiv.org/abs/2603.25737

作者：Yuxing Lu,Xukai Zhao,Wei Wu,Jinzhuo Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：retrieval-augmented generation, system is typically, irrelevant content, knowledge base, typically assembled

备注： 15 pages

点击查看摘要

Abstract:The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

2. 【2603.25723】Natural-Language Agent Harnesses

链接：https://arxiv.org/abs/2603.25723

作者：Linyue Pan,Lexiao Zou,Shuo Guo,Jingchen Ni,Hai-Tao Zheng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：performance increasingly depends, Agent performance increasingly, runtime-specific conventions, making it hard, hard to transfer

备注： under review

点击查看摘要

Abstract:Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

3. 【2603.25702】S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

链接：https://arxiv.org/abs/2603.25702

作者：Ligong Han,Hao Wang,Han Gao,Kai Xu,Akash Srivastava

类目：Computation and Language (cs.CL)

关键词：combining block-wise autoregressive, generation by combining, offer a promising, promising path, combining block-wise

备注： Code is available at [this https URL](https://github.com/phymhan/S2D2)

点击查看摘要

Abstract:Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

4. 【2603.25681】Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

链接：https://arxiv.org/abs/2603.25681

作者：Haoyan Yang,Mario Xerri,Solha Park,Huajian Zhang,Yiyang Feng,Sai Akhil Kogilathota,Jiawei Zhou

类目：Computation and Language (cs.CL)

关键词：continue to advance, improving them solely, limited in scalability, increasingly costly, costly and limited

备注：

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.

5. 【2603.25674】Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

链接：https://arxiv.org/abs/2603.25674

作者：Cole Walsh,Rodica Ivan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：educational testing industry, scoring, automated scoring systems, automated scoring, widely adopted

备注： Shortened version of this paper accepted to AIED 2026; experiment 3 was omitted from accepted paper due to space restrictions

点击查看摘要

Abstract:Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations'' and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.

6. 【2603.25640】RenoBench: A Citation Parsing Benchmark

链接：https://arxiv.org/abs/2603.25640

作者：Parth Sarin,Juan Pablo Alperin,Adam Buttrick,Dione Mentis

类目：Digital Libraries (cs.DL); Computation and Language (cs.CL)

关键词：machine-readable scholarly infrastructure, Accurate parsing, scholarly infrastructure, Public Knowledge Project, machine-readable scholarly

备注：

点击查看摘要

Abstract:Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized evaluation of citation parsing systems, and provides a foundation for advancing automated citation parsing and metascientific research.

7. 【2603.25638】Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

链接：https://arxiv.org/abs/2603.25638

作者：Mingmeng Geng,Yuhang Dong,Thierry Poibeau

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)

关键词：received sufficient attention, previously received sufficient, large language models, increased frequency, decreased frequency

备注： Visualization of word usage patterns in arXiv abstracts: [this https URL](https://llm-impact.github.io/word-usage-arxiv-abstract/)

点击查看摘要

Abstract:Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

8. 【2603.25620】PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

链接：https://arxiv.org/abs/2603.25620

作者：Minseo Kim,Sujeong Im,Junseong Choi,Junhee Lee,Chaeeun Shim,Edward Choi

类目：Computation and Language (cs.CL)

关键词：Large language model, Large language, language model, diverse domains, based persona agents

备注： 20 pages, 6 figures

点击查看摘要

Abstract:Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: this https URL

9. 【2603.25562】Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

链接：https://arxiv.org/abs/2603.25562

作者：Yuqian Fu,Haohuan Huang,Kaiwen Jiang,Yuanheng Zhu,Dongbin Zhao

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language model, On-policy distillation, fixed teacher traces, evaluates teacher feedback, language model

备注：

点击查看摘要

Abstract:On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

10. 【2603.25537】Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

链接：https://arxiv.org/abs/2603.25537

作者：Nikolai Ilinykh,Hyewon Jang,Shalom Lappin,Asad Sayeed,Sharid Loáiciga

类目：Computation and Language (cs.CL)

关键词：Visual Writing Prompts, Writing Prompts corpus, Visual Writing, Writing Prompts, Prompts corpus

备注： 9 pages of content, 1 page of appendices, 9 tables, 3 figures

点击查看摘要

Abstract:We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at this https URL.

11. 【2603.25531】Synchronous Signal Temporal Logic for Decidable Verification of Cyber-Physical Systems

链接：https://arxiv.org/abs/2603.25531

作者：Partha Roop,Sobhan Chatterjee,Avinash Malik,Nathan Allen,Logan Kenwright

类目：Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)

关键词：Cyber Physical System, Physical System, Cyber Physical, Signal Temporal Logic, Temporal Logic

备注：

点击查看摘要

Abstract:Many Cyber Physical System (CPS) work in a safety-critical environment, where correct execution, reliability and trustworthiness are essential. Signal Temporal Logic (STL) provides a formal framework for checking safety-critical CPS. However, static verification of STL is undecidable in general, except when we want to verify using run-time-based methods, which have limitations. We propose Synchronous Signal Temporal Logic (SSTL), a decidable fragment of STL, which admits static safety and liveness property verification. In SSTL, we assume that a signal is sampled at fixed discrete steps, called ticks, and then propose a hypothesis, called the Signal Invariance Hypothesis (SIH), which is inspired by a similar hypothesis for synchronous programs. We define the syntax and semantics of SSTL and show that SIH is a necessary and sufficient condition for equivalence between an STL formula and its SSTL counterpart. By translating SSTL to LTL_P (LTL defined over predicates), we enable decidable model checking using the SPIN model checker. We demonstrate the approach on a 33-node human heart model and other case studies.

12. 【2603.25501】An Experimental Comparison of the Most Popular Approaches to Fake News Detection

链接：https://arxiv.org/abs/2603.25501

作者：Pietro Dell'Oglio,Alessandro Bondielli,Francesco Marcelloni,Lucia C. Passaro

类目：Computation and Language (cs.CL)

关键词：received increasing attention, Large Language Models, recent years, scientific research, received increasing

备注：

点击查看摘要

Abstract:In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into "Real" and "Fake" to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.

13. 【2603.25489】ranslation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

链接：https://arxiv.org/abs/2603.25489

作者：Jannis Vamvas,Ignacio Pérez Prat,Angela Heldstab,Dominic P. Fischer,Sina Ahmadi,Rico Sennrich

类目：Computation and Language (cs.CL)

关键词：Recent strategies, machine translation rely, low-resource machine translation, strategies for low-resource, low-resource machine

备注： Preprint

点击查看摘要

Abstract:Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

14. 【2603.25422】Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

链接：https://arxiv.org/abs/2603.25422

作者：Erkan Gunes,Christoffer Florczak,Tevfik Murat Yildirim

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large Language Models, Large Language, existing computational methods, rival existing computational, Recent developments

备注：

点击查看摘要

Abstract:Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

15. 【2603.25419】APO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

链接：https://arxiv.org/abs/2603.25419

作者：Xu Huang,Zhejian Lai,Zixian Huang,Jiajun Chen,Shujian Huang

类目：Computation and Language (cs.CL)

关键词：demonstrated remarkable proficiency, significant performance disparity, performance disparity persists, Large Language Models, Large Language

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

16. 【2603.25374】Supercharging Federated Intelligence Retrieval

链接：https://arxiv.org/abs/2603.25374

作者：Dimitris Stripelis,Patrick Foley,Mohammad Naseri,William Lindskog-Münzing,Chong Shen Ng,Daniel Janes Beutel,Nicholas D. Lane

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：typically assumes centralized, assumes centralized access, RAG typically assumes, private data silos, access to documents

备注： 6 pages, 1 figure, 2 tables

点击查看摘要

Abstract:RAG typically assumes centralized access to documents, which breaks down when knowledge is distributed across private data silos. We propose a secure Federated RAG system built using Flower that performs local silo retrieval, while server-side aggregation and text generation run inside an attested, confidential compute environment, enabling confidential remote LLM inference even in the presence of honest-but-curious or compromised servers. We also propose a cascading inference approach that incorporates a non-confidential third-party model (e.g., Amazon Nova) as auxiliary context without weakening confidentiality.

17. 【2603.25340】Large Language Model as Token Compressor and Decompressor

链接：https://arxiv.org/abs/2603.25340

作者：Wenbing Li,Zikai Song,Jielei Zhang,Tianhao Zhao,Junkai Lin,Yiran Wang,Wei Yang

类目：Computation and Language (cs.CL)

关键词：excellent token compressor, compressor and decompressor, LLM can function, variable-length latent codes, pretrained LLM

备注：

点击查看摘要

Abstract:In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

18. 【2603.25333】Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

链接：https://arxiv.org/abs/2603.25333

作者：Paulo Roberto de Moura Júnior,Jean Lelong,Annabelle Blangero

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Retrieval-Augmented Generation, segmented into smaller, indexing and retrieval, effectiveness of Retrieval-Augmented, highly dependent

备注： Accepted at LREC 2026. 10 pages, 4 figures. Code: [this https URL](https://github.com/ekimetrics/adaptive-chunking)

点击查看摘要

Abstract:The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at this https URL.

19. 【2603.25329】Beyond Detection: Rethinking Education in the Age of AI-writing

链接：https://arxiv.org/abs/2603.25329

作者：Maria Marina,Alexander Panchenko,Vasily Konovalov

类目：Computation and Language (cs.CL)

关键词：ChatGPT enter classrooms, workplaces and everyday, everyday thinking, automated and stripped, generative AI tools

备注： 8 pages, AIED 2025

点击查看摘要

Abstract:As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality -- outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing -- messy, slow, often frustrating -- is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.

20. 【2603.25309】Separate Before You Compress: The WWHO Tokenization Architecture

链接：https://arxiv.org/abs/2603.25309

作者：Kusal Darshana

类目：Computation and Language (cs.CL)

关键词：Current Large Language, Large Language Models, simple structured Latin, Byte Pair Encoding, Current Large

备注： 17 pages, 1 figure, 8 tables. Tokenization Architecture including formal DFA definitions and regular expressions for Sinhala and Devanagari syllabification. Evaluation includes comparisons with OpenAI o200k-base, Llama-4-Scout, and DeepSeek-V3. Source code and datasets: [this https URL](https://github.com/remeinium/WWHO)

点击查看摘要

Abstract:Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.

21. 【2603.25293】DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers

链接：https://arxiv.org/abs/2603.25293

作者：Shu Wan,Saketh Vishnubhatla,Iskander Kushbay,Tom Heffernan,Aaron Belikoff,Raha Moraffah,Huan Liu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Directed Acyclic Graphs, Directed Acyclic, Acyclic Graphs, represent structured knowledge, DAG

备注：

点击查看摘要

Abstract:Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.

22. 【2603.25269】When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech

链接：https://arxiv.org/abs/2603.25269

作者：Nicolás Benjamín Ocampo,Tommaso Caselli,Davide Ceolin

类目：Computation and Language (cs.CL)

关键词：Hateful content online, necessarily correct information, coordinated online harassment, online harassment campaigns, Hateful content

备注：

点击查看摘要

Abstract:Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.

23. 【2603.25268】CRAFT: Grounded Multi-Agent Coordination Under Partial Information

链接：https://arxiv.org/abs/2603.25268

作者：Abhijnan Nath,Hannah VanderHoeven,Nikhil Krishnaswamy

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：strict partial information, introduce CRAFT, partial information, evaluating pragmatic communication, benchmark for evaluating

备注：

点击查看摘要

Abstract:We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at this https URL

24. 【2603.25253】MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

链接：https://arxiv.org/abs/2603.25253

作者：Taolin Han,Shuang Wu,Jinghang Wang,Yuhao Zhou,Renquan Lv,Bing Zhao,Wei Hu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：hold considerable potential, Large language models, Large language, single-turn Question Answering, advancing scientific discovery

备注：

点击查看摘要

Abstract:Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs' abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs' strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

25. 【2603.25227】Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

链接：https://arxiv.org/abs/2603.25227

作者：Giuseppe Samo,Paola Merlo

类目：Computation and Language (cs.CL)

关键词：French and Italian, Blackbird Language Matrices, evaluating large language, passive verb alternation, alternation in French

备注： 13 pages, 8 figures, paper accepted at the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

点击查看摘要

Abstract:This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.

26. 【2603.25226】WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

链接：https://arxiv.org/abs/2603.25226

作者：Fanheng Kong,Jingyuan Zhang,Yang Yue,Chenxi Sun,Yang Tian,Shi Feng,Xiaocui Yang,Daling Wang,Yu Tian,Jun Du,Wenchong Zeng,Han Li,Kun Gai

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：Large Language Models, natural language instructions, Language Models, build complete projects, Large Language

备注： 24 pages, code: [this https URL](https://github.com/friedrichor/WebTestBench)

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at this https URL.

27. 【2603.25222】ranslation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

链接：https://arxiv.org/abs/2603.25222

作者：Danlu Chen,Ka Sing He,Jiahe Tian,Chenghao Xiao,Zhaofeng Wu,Taylor Berg-Kirkpatrick,Freda Shi

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：low-resource machine translation, extremely low-resource machine, language pairs difficult, machine translation, landscape of extremely

备注：

点击查看摘要

Abstract:The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups -- such as ancient languages -- it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages -- particularly extinct and non-Latin indigenous languages -- suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.

28. 【2603.25203】Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

链接：https://arxiv.org/abs/2603.25203

作者：Ruichao Yang,Wei Gao,Xiaobin Zhu,Jing Ma,Hongzhan Lin,Ziyang Luo,Bo-Wen Zhang,Xu-Cheng Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：evades traditional detectors, opaque black boxes, Multimodal misinformation poses, Probabilistic Concept Graph, traditional detectors

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

29. 【2603.25201】SafeMath: Inference-time Safety improves Math Accuracy

链接：https://arxiv.org/abs/2603.25201

作者：Sagnik Basu,Subhrajit Mitra,Aman Juneja,Somnath Banerjee,Rima Hazra,Animesh Mukherjee

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Recent research points, seemingly benign inputs, Recent research, benign inputs, research points

备注： Submitted in ARR March 2026

点击查看摘要

Abstract:Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at this https URL.

30. 【2603.25196】A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

链接：https://arxiv.org/abs/2603.25196

作者：Andong Tan,Shuyu Dai,Jinglu Wang,Fengtao Zhou,Yan Lu,Xi Wang,Yingcong Chen,Can Yang,Shujie Liu,Hao Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

31. 【2603.25189】A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

链接：https://arxiv.org/abs/2603.25189

作者：Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri

类目：Computation and Language (cs.CL)

关键词：Recent research, NLP has identified, identified data scarcity, dialectal NLP, data

备注：

点击查看摘要

Abstract:Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

32. 【2603.25187】Probing the Lack of Stable Internal Beliefs in LLMs

链接：https://arxiv.org/abs/2603.25187

作者：Yifan Luo,Kangping Xu,Yanzhen Lu,Yang Yuan,Andrew Chi-Chih Yao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, require consistent behavioral, consistent behavioral tendencies, Persona-driven large language, human-like personality traits

备注： Accepted by NeurIPS 2025 Workshop Mexico City PersonaNLP

点击查看摘要

Abstract:Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users' guesses with "yes/no" answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit "goals" shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.

33. 【2603.25183】Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

链接：https://arxiv.org/abs/2603.25183

作者：Ying Li,Xinglin Lyu,Junhui Li,Jinlong Yang,Hengchao Shang,Min Zhang,Shimin Tao,Daimeng Wei

类目：Computation and Language (cs.CL)

关键词：consistently outperform sentence-level, leverages document-level information, leverages document-level, beneficial across sentences, consistently outperform

备注：

点击查看摘要

Abstract:Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model's ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.

34. 【2603.25178】Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

链接：https://arxiv.org/abs/2603.25178

作者：Wanjiang Weng,Xiaofeng Tan,Xiangbo Shu,Guo-Sen Xie,Pan Zhou,Hongsong Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：holds significant potential, generation holds significant, cross-linguistic applications, holds significant, significant potential

备注： 11 pages, 7 figures

点击查看摘要

Abstract:Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{this https URL}{this https URL}

35. 【2603.25176】Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

链接：https://arxiv.org/abs/2603.25176

作者：Hieu Xuan Le,Benjamin Goh,Quy Anh Tang

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Language Model, Large Language, risk to Large, critical security risk

备注： 16 pages, 3 figures

点击查看摘要

Abstract:Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

36. 【2603.25169】o Write or to Automate Linguistic Prompts, That Is the Question

链接：https://arxiv.org/abs/2603.25169

作者：Marina Sánchez-Torrón,Daria Akselrod,Jason Rauchwerk

类目：Computation and Language (cs.CL)

关键词：LLM performance, tasks remains unexplored, linguistic tasks remains, remains unexplored, performance is highly

备注： 10 pages, to be submitted for EAMT 2026

点击查看摘要

Abstract:LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

37. 【2603.25150】Goodness-of-pronunciation without phoneme time alignment

链接：https://arxiv.org/abs/2603.25150

作者：Jeremy H. M. Wong,Nancy F. Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：Automatic Speech Recognition, Speech Recognition, Automatic Speech, speech evaluation, computes time boundaries

备注：

点击查看摘要

Abstract:In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.

38. 【2603.25112】Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

链接：https://arxiv.org/abs/2603.25112

作者：Jon-Paul Cacioli

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Signal Detection Theory, Brier score, Standard evaluation, LLM confidence relies, relies on calibration

备注： 12 pages, 3 figures, 7 tables. Pre-registered; code and data at [this https URL](https://anonymous.4open.science/r/sdt_calibration)

点击查看摘要

Abstract:Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.

39. 【2603.25105】OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

链接：https://arxiv.org/abs/2603.25105

作者：Suraj Racha,Prashant Harish Joshi,Utkarsh Maurya,Nitin Yadav,Mridul Sharma,Ananya Kunisetty,Saranya Darisipudi,Nirmal Punjabi,Ganesh Ramakrishnan

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, poses specific challenges, specifically mental health

备注： 9 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.

40. 【2603.25052】Closing the Confidence-Faithfulness Gap in Large Language Models

链接：https://arxiv.org/abs/2603.25052

作者：Miranda Muqing Miao,Lyle Ungar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：remain poorly understood, Large language models, geometric relationship governing, behavior remain poorly, Large language

备注：

点击查看摘要

Abstract:Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

41. 【2603.25051】Approaches to Analysing Historical Newspapers Using LLMs

链接：https://arxiv.org/abs/2603.25051

作者：Filip Dobranić,Tina Munda,Oliver Pejić,Vojko Gorjanc,Uroš Šmajdek,David Bordon,Jakob Lenardič,Tjaša Konovšek,Kristina Pahor de Maiti Tekavčič,Ciril Bohak,Darja Fišer

类目：Computation and Language (cs.CL)

关键词：based aspect-level sentiment, Slovenski narod, large language model, combining topic modelling, Slovene historical newspapers

备注：

点击查看摘要

Abstract:This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

42. 【2603.25040】Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

链接：https://arxiv.org/abs/2603.25040

作者：Yicheng Zou,Dongsheng Zhu,Lin Zhu,Tong Zhu,Yunhua Zhou,Peiheng Zhou,Xinyu Zhou,Dongzhan Zhou,Zhiwang Zhou,Yuhao Zhou,Bowen Zhou,Zhanping Zhong,Zhijie Zhong,Haiteng Zhao,Penghao Zhao,Xiaomeng Zhao,Zhiyuan Zhao,Yechen Zhang,Jin Zhang,Wenwei Zhang,Hongjie Zhang,Zhuo Zhang,Wenlong Zhang,Bo Zhang,Chao Zhang,Chen Zhang,Yuhang Zang,Fei Yuan,Jiakang Yuan,Jiashuo Yu,Jinhui Yin,Haochen Ye,Qian Yao,Bowen Yang,Danni Yang,Kaichen Yang,Ziang Yan,Jun Xu,Yicheng Xu,Wanghan Xu,Xuenan Xu,Chao Xu,Ruiliang Xu,Shuhao Xing,Long Xing,Xinchen Xie,Ling-I Wu,Zijian Wu,Zhenyu Wu,Lijun Wu,Yue Wu,Jianyu Wu,Wen Wu,Fan Wu,Xilin Wei,Qi Wei,Bingli Wang,Rui Wang,Ziyi Wang,Zun Wang,Yi Wang,Haomin Wang,Yizhou Wang,Lintao Wang,Yiheng Wang,Longjiang Wang,Bin Wang,Jian Tong,Zhongbo Tian,Huanze Tang,Chen Tang,Shixiang Tang,Yu Sun,Qiushi Sun,Xuerui Su,Qisheng Su,Chenlin Su,Demin Song,Jin Shi,Fukai Shang,Yuchen Ren,Pengli Ren,Xiaoye Qu,Yuan Qu,Jiantao Qiu,Yu Qiao,Runyu Peng,Tianshuo Peng,Jiahui Peng,Qizhi Pei,Zhuoshi Pan,Linke Ouyang,Wenchang Ning,Yichuan Ma,Zerun Ma,Ningsheng Ma,Runyuan Ma,Chengqi Lyu,Haijun Lv,Han Lv

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal foundation model, scientific multimodal foundation, multimodal foundation, foundation model, scientific multimodal

备注：

点击查看摘要

Abstract:We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.

43. 【2603.25015】Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models

链接：https://arxiv.org/abs/2603.25015

作者：Tony Mason

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词：cooperate in English, English compete, opposite interaction topology, semantic content, opposite interaction

备注：

点击查看摘要

Abstract:System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: "NEVER do X" is an exercise of authority whose force is language-dependent, while "X: disabled" is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.

44. 【2603.24981】Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection

链接：https://arxiv.org/abs/2603.24981

作者：Xiaowei Zhu,Yubing Ren,Fang Fang,Shi Wang,Yanan Cao,Li Guo

类目：Computation and Language (cs.CL)

关键词：raising societal risks, large language models, authorship ambiguity, raising societal, misinformation dissemination

备注：

点击查看摘要

Abstract:The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2\% relative improvement in average AUROC over the strongest prior baseline on DetectRL.

45. 【2603.24979】LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

链接：https://arxiv.org/abs/2603.24979

作者：Yuhang Zhou,Zhuokai Zhao,Ke Li,Spilios Evmorfos,Gökalp Demirci,Mingyi Wang,Qiao Liu,Qifei Wang,Serena Li,Weiwei Li,Tingting Wang,Mingze Gao,Gedi Zhou,Abhishek Kumar,Xiangjun Fan,Lizhu Zhang,Jiayi Liu

类目：Computation and Language (cs.CL)

关键词：directly affecting model, machine learning systems, large-scale industrial machine, industrial machine learning, affecting model accuracy

备注： 11 pages, 2 tables

点击查看摘要

Abstract:Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.

46. 【2603.24961】Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

链接：https://arxiv.org/abs/2603.24961

作者：Dingjie Song,Tianlong Xu,Yi-Fan Zhang,Hang Li,Zhiling Yan,Xing Fan,Haoyang Li,Lichao Sun,Qingsong Wen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：presents unique challenges, unique challenges due, Assessing student handwritten, personalized educational feedback, varied problem-solving approaches

备注： Accepted by the 27th International Conference on Artificial Intelligence in Education (AIED'26)

点击查看摘要

Abstract:Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

47. 【2603.24955】oward domain-specific machine translation and quality estimation systems

链接：https://arxiv.org/abs/2603.24955

作者：Javad Pourmostafa Roshan Sharami

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Machine Translation, Quality Estimation, translation quality, Chapter, Machine

备注： PhD Dissertation

点击查看摘要

Abstract:Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

48. 【2603.24943】FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

链接：https://arxiv.org/abs/2603.24943

作者：Jie Zhu,Yimin Tian,Boyang Li,Kehao Wu,Zhongzhi Liang,Junhui Li,Xianyin Zhang,Lifan Guo,Feng Chen,Yong Liu,Chi Zhang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：evaluating large language, model context protocols, large language models, real-world financial problems, solving real-world financial

备注： Accepted by ICASSP 2026

点击查看摘要

Abstract:This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

49. 【2603.24941】Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

链接：https://arxiv.org/abs/2603.24941

作者：Peiju Liu,Jinming Liu,Xipeng Qiu,Xuanjing Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：significant inference latency, inference latency due, processing dense visual, dense visual tokens, models excel

备注： 10 pages, 7 figures, preprint

点击查看摘要

Abstract:Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.

50. 【2603.24929】LogitScope: A Framework for Analyzing LLM Uncertainty Through Information Metrics

链接：https://arxiv.org/abs/2603.24929

作者：Farhan Ahmed,Yuya Jeremy Ong,Chad DeLuca

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词：Understanding and quantifying, large language model, outputs is critical, reliable deployment, large language

备注：

点击查看摘要

Abstract:Understanding and quantifying uncertainty in large language model (LLM) outputs is critical for reliable deployment. However, traditional evaluation approaches provide limited insight into model confidence at individual token positions during generation. To address this issue, we introduce LogitScope, a lightweight framework for analyzing LLM uncertainty through token-level information metrics computed from probability distributions. By measuring metrics such as entropy and varentropy at each generation step, LogitScope reveals patterns in model confidence, identifies potential hallucinations, and exposes decision points where models exhibit high uncertainty, all without requiring labeled data or semantic interpretation. We demonstrate LogitScope's utility across diverse applications including uncertainty quantification, model behavior analysis, and production monitoring. The framework is model-agnostic, computationally efficient through lazy evaluation, and compatible with any HuggingFace model, enabling both researchers and practitioners to inspect LLM behavior during inference.

51. 【2603.24925】GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2603.24925

作者：Ruizhong Miao,Yuying Wang,Rongguang Wang,Chenyang Li,Tao Sheng,Sujith Ravi,Dan Roth

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：retrieval-augmented generation, Semantic search, insufficient for complex, complex information, relevant evidence

备注：

点击查看摘要

Abstract:Semantic search in retrieval-augmented generation (RAG) systems is often insufficient for complex information needs, particularly when relevant evidence is scattered across multiple sources. Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries. However, these methods do not fully leverage the organizational structure of the data and instead rely on iterative exploration, which can lead to inefficient retrieval. Another class of approaches employs knowledge graphs to model non-semantic relationships through graph edges. Although effective in capturing richer proximities, such methods incur significant maintenance costs and are often incompatible with the vector stores used in most production systems. To address these limitations, we propose GraphER, a graph-based enrichment and reranking method that captures multiple forms of proximity beyond semantic similarity. GraphER independently enriches data objects during offline indexing and performs graph-based reranking over candidate objects at query time. This design does not require a knowledge graph, allowing GraphER to integrate seamlessly with standard vector stores. In addition, GraphER is retriever-agnostic and introduces negligible latency overhead. Experiments on multiple retrieval benchmarks demonstrate the effectiveness of the proposed approach.

52. 【2603.24917】Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

链接：https://arxiv.org/abs/2603.24917

作者：A. Feder Cooper,Mark A. Lemley,Christopher De Sa,Lea Duesterwald,Allison Casasola,Jamie Hayes,Katherine Lee,Daniel E. Ho,Percy Liang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Recent work shows, Recent work, standard greedy-decoding extraction, near-verbatim extraction risk, extraction risk varies

备注：

点击查看摘要

Abstract:Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction -- computing the probability of generating a target suffix given a prefix under a decoding scheme -- addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

53. 【2603.24896】LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis

链接：https://arxiv.org/abs/2603.24896

作者：Baraa Hikal,Jonas Becker,Bela Gipp

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Aspect-Based Sentiment Analysis, Dimensional Aspect-Based Sentiment, Sentiment Analysis, Dimensional Aspect-Based, paper describes LogSigma

备注：

点击查看摘要

Abstract:This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.

54. 【2603.24866】How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

链接：https://arxiv.org/abs/2603.24866

作者：Luyu Yang,Yutong Dai,An Yan,Viraj Prabhu,Ran Xu,Zeyuan Chen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：governed by rigorous, procedural constraints, physical world, physical, physical generative reasoning

备注：

点击查看摘要

Abstract:The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at this https URL

55. 【2603.24857】AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

链接：https://arxiv.org/abs/2603.24857

作者：Zhenyi Wang,Siyu Luan

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：rightarrow, machine learning, systems expand, scale and functionality, increasingly complex

备注： Published at Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:As machine learning (ML) systems expand in both scale and functionality, the security landscape has become increasingly complex, with a proliferation of attacks and defenses. However, existing studies largely treat these threats in isolation, lacking a coherent framework to expose their shared principles and interdependencies. This fragmented view hinders systematic understanding and limits the design of comprehensive defenses. Crucially, the two foundational assets of ML -- \textbf{data} and \textbf{models} -- are no longer independent; vulnerabilities in one directly compromise the other. The absence of a holistic framework leaves open questions about how these bidirectional risks propagate across the ML pipeline. To address this critical gap, we propose a \emph{unified closed-loop threat taxonomy} that explicitly frames model-data interactions along four directional axes. Our framework offers a principled lens for analyzing and defending foundation models. The resulting four classes of security threats represent distinct but interrelated categories of attacks: (1) Data$\rightarrow$Data (D$\rightarrow$D): including \emph{data decryption attacks and watermark removal attacks}; (2) Data$\rightarrow$Model (D$\rightarrow$M): including \emph{poisoning, harmful fine-tuning attacks, and jailbreak attacks}; (3) Model$\rightarrow$Data (M$\rightarrow$D): including \emph{model inversion, membership inference attacks, and training data extraction attacks}; (4) Model$\rightarrow$Model (M$\rightarrow$M): including \emph{model extraction attacks}. Our unified framework elucidates the underlying connections among these security threats and establishes a foundation for developing scalable, transferable, and cross-modal security strategies, particularly within the landscape of foundation models.

56. 【2603.24844】Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

链接：https://arxiv.org/abs/2603.24844

作者：Isha Puri,Mehul Damani,Idan Shenfeld,Marzyeh Ghassemi,Jacob Andreas,Yoon Kim

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：implicitly encodes, multiple, answers, generate multiple, generate

备注：

点击查看摘要

Abstract:Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at this https URL.

57. 【2603.24840】Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

链接：https://arxiv.org/abs/2603.24840

作者：Haobo Xu,Sirui Chen,Ruizhong Qiu,Yuchen Yan,Chen Luo,Monica Cheng,Jingrui He,Hanghang Tong

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, capabilities of Large, Verifiable Rewards, Reinforcement Learning

备注： 17 pages, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at this https URL.

58. 【2603.24826】Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

链接：https://arxiv.org/abs/2603.24826

作者：Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini

类目：Computation and Language (cs.CL)

关键词：focus on English, improving language model, language model pretraining, Synthetic data generation, Portuguese continued pretraining

备注：

点击查看摘要

Abstract:Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

59. 【2603.24797】Enhancing Structured Meaning Representations with Aspect Classification

链接：https://arxiv.org/abs/2603.24797

作者：Claire Benét Post,Paul Bontempo,August Milliken,Alvin Po-Chun Chen,Nicholas Derby,Saksham Khatwani,Sumeyye Nabieva,Karthik Sairam,Alexis Palmer

类目：Computation and Language (cs.CL)

关键词：internal temporal structure, Uniform Meaning Representations, meaning representation frameworks, meaning representation, Abstract Meaning Representation

备注： 15 pages, 3 figures, 8 tables

点击查看摘要

Abstract:To fully capture the meaning of a sentence, semantic representations should encode aspect, which describes the internal temporal structure of events. In graph-based meaning representation frameworks such as Uniform Meaning Representations (UMR), aspect lets one know how events unfold over time, including distinctions such as states, activities, and completed events. Despite its importance, aspect remains sparsely annotated across semantic meaning representation frameworks. This has, in turn, hindered not only current manual annotation, but also the development of automatic systems capable of predicting aspectual information. In this paper, we introduce a new dataset of English sentences annotated with UMR aspect labels over Abstract Meaning Representation (AMR) graphs that lack the feature. We describe the annotation scheme and guidelines used to label eventive predicates according to the UMR aspect lattice, as well as the annotation pipeline used to ensure consistency and quality across annotators through a multi-step adjudication process. To demonstrate the utility of our dataset for future automation, we present baseline experiments using three modeling approaches. Our results establish initial benchmarks for automatic UMR aspect prediction and provide a foundation for integrating aspect into semantic meaning representations more broadly.

60. 【2603.24772】Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

链接：https://arxiv.org/abs/2603.24772

作者：Mohammed Nowshad Ruhani Chowdhury,Mohammed Nowaz Rabbani Chowdhury,Sakari Lukkarinen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：patient safety, continuity of care, Finnish, large language model, fine-tuning

备注： 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.

61. 【2603.24767】Fine-Tuning A Large Language Model for Systematic Review Screening

链接：https://arxiv.org/abs/2603.24767

作者：Kweku Yamoah,Noah Schroeder,Emmanuel Dorley,Neha Rani,Caleb Schutz

类目：Computation and Language (cs.CL)

关键词：energy to complete, considerable amounts, time and energy, part due, extensive number

备注：

点击查看摘要

Abstract:Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.

62. 【2603.24755】SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

链接：https://arxiv.org/abs/2603.24755

作者：Gabriel Orlanski,Devjeet Roy,Alexander Yun,Changho Shin,Alex Gu,Albert Ge,Dyah Adila,Frederic Sala,Aws Albarghouthi

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：overwhelmingly evaluate single-shot, evaluate single-shot solutions, agentic coding benchmarks, coding benchmarks overwhelmingly, benchmarks overwhelmingly evaluate

备注： Code and Leaderboards are located at [this https URL](https://www.scbench.ai)

点击查看摘要

Abstract:Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

63. 【2603.24709】raining LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

链接：https://arxiv.org/abs/2603.24709

作者：Cheng Jiayang,Xin Liu,Zhihan Zhang,Haoyang Wen,Zixuan Zhang,Qingyu Yin,Shiyang Li,Priyanka Nigam,Bing Yin,Chao Zhang,Yangqiu Song

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：propagating intermediate outputs, invoke multiple dependent, multiple dependent APIs, remains challenging, intermediate outputs

备注： Under Review

点击查看摘要

Abstract:Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

Comments:
Under Review

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2603.24709 [cs.LG]

(or
arXiv:2603.24709v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.24709

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

64. 【2603.24652】Demystifying When Pruning Works via Representation Hierarchies

链接：https://arxiv.org/abs/2603.24652

作者：Shwai He,Guoheng Sun,Haichao Zhang,Yun Fu,Ang Li

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：parameters or architectures, preserving performance, removes less important, important parameters, expected to improve

备注： 26 pages, 21 figures, Table 3

点击查看摘要

Abstract:Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at this https URL

65. 【2603.24651】When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

链接：https://arxiv.org/abs/2603.24651

作者：Hasindri Watawana,Sergio Burdisso,Diego A. Moreno-Galván,Fernando Sánchez-Vega,A. Pastor López-Monroy,Petr Motlicek,Esaú Villatoro-Tello

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：Automatic depression detection, Automatic depression, depression detection, detection from doctor-patient, doctor-patient conversations

备注： Accepted to LREC 2026 Conference

点击查看摘要

Abstract:Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants' language.

66. 【2402.05122】History of generative Artificial Intelligence (AI) chatbots: past, present, and future development

链接：https://arxiv.org/abs/2402.05122

作者：Md. Al-Amin,Mohammad Shazed Ali,Abdus Salam,Arif Khan,Ashraf Ali,Ahsan Ullah,Md Nur Alam,Shamsul Kabir Chowdhury

类目：General Literature (cs.GL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：in-depth comprehensive review, initial basic systems, basic systems relying, conversational bots powered, today advanced conversational

备注：

点击查看摘要

Abstract:This research provides an in-depth comprehensive review of the progress of chatbot technology over time, from the initial basic systems relying on rules to today's advanced conversational bots powered by artificial intelligence. Spanning many decades, the paper explores the major milestones, innovations, and paradigm shifts that have driven the evolution of chatbots. Looking back at the very basic statistical model in 1906 via the early chatbots, such as ELIZA and ALICE in the 1960s and 1970s, the study traces key innovations leading to today's advanced conversational agents, such as ChatGPT and Google Bard. The study synthesizes insights from academic literature and industry sources to highlight crucial milestones, including the introduction of Turing tests, influential projects such as CALO, and recent transformer-based models. Tracing the path forward, the paper highlights how natural language processing and machine learning have been integrated into modern chatbots for more sophisticated capabilities. This chronological survey of the chatbot landscape provides a holistic reference to understand the technological and historical factors propelling conversational AI. By synthesizing learnings from this historical analysis, the research offers important context about the developmental trajectory of chatbots and their immense future potential across various field of application which could be the potential take ways for the respective research community and stakeholders.

67. 【2603.24596】X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

链接：https://arxiv.org/abs/2603.24596

作者：Di Cao,Dongjie Fu,Hai Yu,Siqi Zheng,Xu Tan,Tao Jin

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：speech Large Language, Large Language Models, Large Language, cascaded dialogue systems, significant performance degradation

备注： 5 pages

点击查看摘要

Abstract:While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

信息检索

1. 【2603.25737】raining the Knowledge Base through Evidence Distillation and Write-Back Enrichment

链接：https://arxiv.org/abs/2603.25737

作者：Yuxing Lu,Xukai Zhao,Wei Wu,Jinzhuo Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：retrieval-augmented generation, system is typically, irrelevant content, knowledge base, typically assembled

备注： 15 pages

点击查看摘要

2. 【2603.25500】Unveiling the Resilience of LLM-Enhanced Search Engines against Black-Hat SEO Manipulation

链接：https://arxiv.org/abs/2603.25500

作者：Pei Chen,Geng Hong,Xinyi Wu,Mengying Wu,Zixuan Zhu,Mingxuan Liu,Baojun Liu,Mi Zhang,Min Yang

类目：Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词：Large Language Model-enhanced, Language Model-enhanced Search, Model-enhanced Search Engines, Large Language, Language Model-enhanced

备注： Accepted at The ACM Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:The emergence of Large Language Model-enhanced Search Engines (LLMSEs) has revolutionized information retrieval by integrating web-scale search capabilities with AI-powered summarization. While these systems demonstrate improved efficiency over traditional search engines, their security implications against well-established black-hat Search Engine Optimization (SEO) attacks remain unexplored. In this paper, we present the first systematic study of SEO attacks targeting LLMSEs. Specifically, we examine ten representative LLMSE products (e.g., ChatGPT, Gemini) and construct SEO-Bench, a benchmark comprising 1,000 real-world black-hat SEO websites, to evaluate both open- and closed-source LLMSEs. Our measurements show that LLMSEs mitigate over 99.78% of traditional SEO attacks, with the phase of retrieval serving as the primary filter, intercepting the vast majority of malicious queries. We further propose and evaluate seven LLMSEO attack strategies, demonstrating that off-the-shelf LLMSEs are vulnerable to LLMSEO attacks, i.e., rewritten-query stuffing and segmented texts double the manipulation rate compared to the baseline. This work offers the first in-depth security analysis of the LLMSE ecosystem, providing practical insights for building more resilient AI-driven search systems. We have responsibly reported the identified issues to major vendors.

Comments:
Accepted at The ACM Web Conference 2026 (WWW 2026)

Subjects:

Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.25500 [cs.CR]

(or
arXiv:2603.25500v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2603.25500

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

3. 【2603.25374】Supercharging Federated Intelligence Retrieval

链接：https://arxiv.org/abs/2603.25374

作者：Dimitris Stripelis,Patrick Foley,Mohammad Naseri,William Lindskog-Münzing,Chong Shen Ng,Daniel Janes Beutel,Nicholas D. Lane

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词：typically assumes centralized, assumes centralized access, RAG typically assumes, private data silos, access to documents

备注： 6 pages, 1 figure, 2 tables

点击查看摘要

4. 【2603.25333】Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

链接：https://arxiv.org/abs/2603.25333

作者：Paulo Roberto de Moura Júnior,Jean Lelong,Annabelle Blangero

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Retrieval-Augmented Generation, segmented into smaller, indexing and retrieval, effectiveness of Retrieval-Augmented, highly dependent

备注： Accepted at LREC 2026. 10 pages, 4 figures. Code: [this https URL](https://github.com/ekimetrics/adaptive-chunking)

点击查看摘要

5. 【2603.25248】ColBERT-Att: Late-Interaction Meets Attention for Enhanced Retrieval

链接：https://arxiv.org/abs/2603.25248

作者：Raj Nath Patel,Sourav Dutta

类目：Information Retrieval (cs.IR)

关键词：Neural Information Retrieval, knowledge extraction tasks, Information Retrieval systems, pre-trained language models, language models form

备注： 5 pages

点击查看摘要

Abstract:Vector embeddings from pre-trained language models form a core component in Neural Information Retrieval systems across a multitude of knowledge extraction tasks. The paradigm of late interaction, introduced in ColBERT, demonstrates high accuracy along with runtime efficiency. However, the current formulation fails to take into account the attention weights of query and document terms, which intuitively capture the "importance" of similarities between them, that might lead to a better understanding of relevance between the queries and documents. This work proposes ColBERT-Att, to explicitly integrate attention mechanism into the late interaction framework for enhanced retrieval performance. Empirical evaluation of ColBERT-Att depicts improvements in recall accuracy on MS-MARCO as well as on a wide range of BEIR and LoTTE benchmark datasets.

6. 【2603.25152】UniAI-GraphRAG: Synergizing Ontology-Guided Extraction, Multi-Dimensional Clustering, and Dual-Channel Fusion for Robust Multi-Hop Reasoning

链接：https://arxiv.org/abs/2603.25152

作者：Jie Wang,Honghua Huang,Xi Ge,Jianhui Su,Wen Liu,Shiguo Lian

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：systems face significant, face significant challenges, Retrieval-Augmented Generation, systems face, complex reasoning

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in cross-industry adaptability, community report integrity, and retrieval performance. This paper proposes UniAI-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHopRAG benchmark show that UniAI-GraphRAG outperforms mainstream open source solutions (this http URL) in comprehensive F1 scores, particularly in inference and temporal queries. The code is available at this https URL.

7. 【2603.25126】MCLMR: A Model-Agnostic Causal Learning Framework for Multi-Behavior Recommendation

链接：https://arxiv.org/abs/2603.25126

作者：Ranxu Zhang,Junjie Meng,Ying Sun,Ziqi Xu,Bing Yin,Hao Li,Yanyong Zhang,Chao Wang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：traditional single-behavior approaches, user interaction types, leverages multiple user, multiple user interaction, enrich preference modeling

备注： Accepted by WWW 2026

点击查看摘要

Abstract:Multi-Behavior Recommendation (MBR) leverages multiple user interaction types (e.g., views, clicks, purchases) to enrich preference modeling and alleviate data sparsity issues in traditional single-behavior approaches. However, existing MBR methods face fundamental challenges: they lack principled frameworks to model complex confounding effects from user behavioral habits and item multi-behavior distributions, struggle with effective aggregation of heterogeneous auxiliary behaviors, and fail to align behavioral representations across semantic gaps while accounting for bias distortions. To address these limitations, we propose MCLMR, a novel model-agnostic causal learning framework that can be seamlessly integrated into various MBR architectures. MCLMR first constructs a causal graph to model confounding effects and performs interventions for unbiased preference estimation. Under this causal framework, it employs an Adaptive Aggregation module based on Mixture-of-Experts to dynamically fuse auxiliary behavior information and a Bias-aware Contrastive Learning module to align cross-behavior representations in a bias-aware manner. Extensive experiments on three real-world datasets demonstrate that MCLMR achieves significant performance improvements across various baseline models, validating its effectiveness and generality. All data and code will be made publicly available. For anonymous review, our code is available at the following the link: this https URL.

8. 【2603.25092】AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2603.25092

作者：Zhihui Yao,Hengran Zhang,Keping Bi

类目：Information Retrieval (cs.IR)

关键词：Large Language Models, enhances Large Language, Language Models, Large Language, Retrieval-Augmented Generation

备注： 11 pages, 4 figures. Submitted to ACL 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) with external knowledge but remains vulnerable to low-authority sources that can propagate misinformation. We investigate whether LLMs can perceive information authority - a capability extending beyond semantic understanding. To address this, we introduce AuthorityBench, a comprehensive benchmark for evaluating LLM authority perception comprising three datasets: DomainAuth (10K web domains with PageRank-based authority), EntityAuth (22K entities with popularity-based authority), and RAGAuth (120 queries with documents of varying authority for downstream evaluation). We evaluate five LLMs using three judging methods (PointJudge, PairJudge, ListJudge) across multiple output formats. Results show that ListJudge and PairJudge with PointScore output achieve the strongest correlation with ground-truth authority, while ListJudge offers optimal cost-effectiveness. Notably, incorporating webpage text consistently degrades judgment performance, suggesting authority is distinct from textual style. Downstream experiments on RAG demonstrate that authority-guided filtering largely improves answer accuracy, validating the practical importance of authority perception for reliable knowledge retrieval. Code and benchmark are available at: this https URL.

9. 【2603.25027】Hyena Operator for Fast Sequential Recommendation

链接：https://arxiv.org/abs/2603.25027

作者：Jiahao Liu,Lin Li,Zhiyuan Li,Kaixi Hu,Kaize Shi,Jingling Yuan

类目：Information Retrieval (cs.IR)

关键词：incur quadratic complexity, histories prohibitively expensive, user histories prohibitively, achieve strong accuracy, achieve strong

备注： 11 pages, 5 figures, accepted by ACM Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:Sequential recommendation models, particularly those based on attention, achieve strong accuracy but incur quadratic complexity, making long user histories prohibitively expensive. Sub-quadratic operators such as Hyena provide efficient alternatives in language modeling, but their potential in recommendation remains underexplored. We argue that Hyena faces challenges in recommendation due to limited representation capacity on sparse, long user sequences. To address these challenges, we propose HyenaRec, a novel sequential recommender that integrates polynomial-based kernel parameterization with gated convolutions. Specifically, we design convolutional kernels using Legendre orthogonal polynomials, which provides a smooth and compact basis for modeling long-term temporal dependencies. A complementary gating mechanism captures fine-grained short-term behavioral bursts, yielding a hybrid architecture that balances global temporal evolution with localized user interests under sparse feedback. This construction enhances expressiveness while scaling linearly with sequence length. Extensive experiments on multiple real-world datasets demonstrate that HyenaRec consistently outperforms Attention-, Recurrent-, and other baselines in ranking accuracy. Moreover, it trains significantly faster (up to 6x speedup), with particularly pronounced advantages on long-sequence scenarios where efficiency is maintained without sacrificing accuracy. These results highlight polynomial-based kernel parameterization as a principled and scalable alternative to attention for sequential recommendation.

10. 【2603.25011】Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval

链接：https://arxiv.org/abs/2603.25011

作者：Thong Nguyen,Cosimo Rulli,Franco Maria Nardini,Rossano Venturini,Andrew Yates

类目：Information Retrieval (cs.IR)

关键词：Learned Sparse Retrieval, Language Modeling, project latent hidden, latent hidden states, Learned Sparse

备注：

点击查看摘要

Abstract:State-of-the-art Learned Sparse Retrieval (LSR) models, such as Splade, typically employ a Language Modeling (LM) head to project latent hidden states into a lexically-anchored logit matrix. This intermediate matrix is subsequently transformed into a sparse lexical representation through element-wise operations (ReLU, Log1P) and max-pooling over the sequence dimension. Despite its effectiveness, the LM head creates a massive memory bottleneck due to the sheer size of the vocabulary (V), which can range from 30,000 to over 250,000 tokens in recent models. Materializing this matrix creates a significant memory bottleneck, limiting model scaling. The resulting I/O overhead between operators further throttles throughput and runtime performance. In this paper, we propose Sparton, a fast memory-efficient Triton kernel tailored for the LM head in LSR models. Sparton utilizes a fused approach that integrates the tiled matrix multiplication, ReLU, Log1P, and max-reduction into a single GPU kernel. By performing an early online reduction directly on raw logit tiles, Sparton avoids materializing the full logit matrix in memory. Our experiments demonstrate that the Sparton kernel, in isolation, achieves up to a 4.8x speedup and an order-of-magnitude reduction in peak memory usage compared to PyTorch baselines. Integrated into Splade (|V| ~ 30k), Sparton enables a 33% larger batch size and 14% faster training with no effectiveness loss. On a multilingual backbone (|V| ~ 250k), these gains jump to a 26x larger batch size and 2.5x faster training.

11. 【2603.24975】Unbiased Multimodal Reranking for Long-Tail Short-Video Search

链接：https://arxiv.org/abs/2603.24975

作者：Wenyi Xu,Feiran Zhu,Songyang Li,Renzhe Zhou,Chao Zhang,Chenglei Dai,Yuren Mao,Yunjun Gao,Yi Zhang

类目：Information Retrieval (cs.IR)

关键词：Kuaishou serving hundreds, Kuaishou serving, searches daily, Large Language Models, serving hundreds

备注：

点击查看摘要

Abstract:Kuaishou serving hundreds of millions of searches daily, the quality of short-video search is paramount. However, it suffers from a severe Matthew effect on long-tail queries: sparse user behavior data causes models to amplify low-quality content such as clickbait and shallow content. The recent advancements in Large Language Models (LLMs) offer a new paradigm, as their inherent world knowledge provides a powerful mechanism to assess content quality, agnostic to sparse user interactions. To this end, we propose a LLM-driven multimodal reranking framework, which estimates user experience without real user behavior. The approach involves a two-stage training process: the first stage uses multimodal evidence to construct high-quality annotations for supervised fine-tuning, while the second stage incorporates pairwise preference optimization to help the model learn partial orderings among candidates. At inference time, the resulting experience scores are used to promote high-quality but underexposed videos in reranking, and further guide page-level optimization through reinforcement learning. Experiments show that the proposed method achieves consistent improvements over strong baselines in offline metrics including AUC, NDCG@K, and human preference judgement. An online A/B test covering 15\% of traffic further demonstrates gains in both user experience and consumption metrics, confirming the practical value of the approach in long-tail video search scenarios.

12. 【2603.24958】DIET: Learning to Distill Dataset Continually for Recommender Systems

链接：https://arxiv.org/abs/2603.24958

作者：Jiaqing Zhang,Hao Wang,Mingjia Yin,Bo Chen,Qinglin Jia,Rui Zhou,Ruiming Tang,ChaoYi Ma,Enhong Chen

类目：Information Retrieval (cs.IR)

关键词：Modern deep recommender, Modern deep, streaming behavioral logs, continual learning paradigm, continuously growing streaming

备注：

点击查看摘要

Abstract:Modern deep recommender models are trained under a continual learning paradigm, relying on massive and continuously growing streaming behavioral logs. In large-scale platforms, retraining models on full historical data for architecture comparison or iteration is prohibitively expensive, severely slowing down model development. This challenge calls for data-efficient approaches that can faithfully approximate full-data training behavior without repeatedly processing the entire evolving data stream. We formulate this problem as \emph{streaming dataset distillation for recommender systems} and propose \textbf{DIET}, a unified framework that maintains a compact distilled dataset which evolves alongside streaming data while preserving training-critical signals. Unlike existing dataset distillation methods that construct a static distilled set, DIET models distilled data as an evolving training memory and updates it in a stage-wise manner to remain aligned with long-term training dynamics. DIET enables effective continual distillation through principled initialization from influential samples and selective updates guided by influence-aware memory addressing within a bi-level optimization framework. Experiments on large-scale recommendation benchmarks demonstrate that DIET compresses training data to as little as \textbf{1-2\%} of the original size while preserving performance trends consistent with full-data training, reducing model iteration cost by up to \textbf{60$\times$}. Moreover, the distilled datasets produced by DIET generalize well across different model architectures, highlighting streaming dataset distillation as a scalable and reusable data foundation for recommender system development.

13. 【2603.24925】GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2603.24925

作者：Ruizhong Miao,Yuying Wang,Rongguang Wang,Chenyang Li,Tao Sheng,Sujith Ravi,Dan Roth

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：retrieval-augmented generation, Semantic search, insufficient for complex, complex information, relevant evidence

备注：

点击查看摘要

14. 【2603.24765】Enhancing Online Support Group Formation Using Topic Modeling Techniques

链接：https://arxiv.org/abs/2603.24765

作者：Pronob Kumar Barman,Tera L. Reynolds,James Foulds

类目：Information Retrieval (cs.IR); Machine Learning (stat.ML)

关键词：Online health communities, support group formation, fostering peer support, Dirichlet Multinomial Regression, Online health

备注：

点击查看摘要

Abstract:Online health communities (OHCs) are vital for fostering peer support and improving health outcomes. Support groups within these platforms can provide more personalized and cohesive peer support, yet traditional support group formation methods face challenges related to scalability, static categorization, and insufficient personalization. To overcome these limitations, we propose two novel machine learning models for automated support group formation: the Group specific Dirichlet Multinomial Regression (gDMR) and the Group specific Structured Topic Model (gSTM). These models integrate user generated textual content, demographic profiles, and interaction data represented through node embeddings derived from user networks to systematically automate personalized, semantically coherent support group formation. We evaluate the models on a large scale dataset from this http URL, comprising over 2 million user posts. Both models substantially outperform baseline methods including LDA, DMR, and STM in predictive accuracy (held out log likelihood), semantic coherence (UMass metric), and internal group consistency. The gDMR model yields group covariates that facilitate practical implementation by leveraging relational patterns from network structures and demographic data. In contrast, gSTM emphasizes sparsity constraints to generate more distinct and thematically specific groups. Qualitative analysis further validates the alignment between model generated groups and manually coded themes, showing the practical relevance of the models in informing groups that address diverse health concerns such as chronic illness management, diagnostic uncertainty, and mental health. By reducing reliance on manual curation, these frameworks provide scalable solutions that enhance peer interactions within OHCs, with implications for patient engagement, community resilience, and health outcomes.

Subjects:

Information Retrieval (cs.IR); Machine Learning (stat.ML)

Cite as:
arXiv:2603.24765 [cs.IR]

(or
arXiv:2603.24765v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.24765

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

15. 【2603.24750】Pseudo Label NCF for Sparse OHC Recommendation: Dual Representation Learning and the Separability Accuracy Trade off

链接：https://arxiv.org/abs/2603.24750

作者：Pronob Kumar Barman,Tera L. Reynolds. James Foulds

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Online Health Communities, Health Communities connect, Communities connect patients, Online Health, Health Communities

备注：

点击查看摘要

Abstract:Online Health Communities connect patients for peer support, but users face a discovery challenge when they have minimal prior interactions to guide personalization. We study recommendation under extreme interaction sparsity in a survey driven setting where each user provides a 16 dimensional intake vector and each support group has a structured feature profile. We extend Neural Collaborative Filtering architectures, including Matrix Factorization, Multi Layer Perceptron, and NeuMF, with an auxiliary pseudo label objective derived from survey group feature alignment using cosine similarity mapped to [0, 1]. The resulting Pseudo Label NCF learns dual embedding spaces: main embeddings for ranking and pseudo label embeddings for semantic alignment. We evaluate on a dataset of 165 users and 498 support groups using a leave one out protocol that reflects cold start conditions. All pseudo label variants improve ranking performance: MLP improves HR@5 from 2.65% to 5.30%, NeuMF from 4.46% to 5.18%, and MF from 4.58% to 5.42%. Pseudo label embedding spaces also show higher cosine silhouette scores than baseline embeddings, with MF improving from 0.0394 to 0.0684 and NeuMF from 0.0263 to 0.0653. We further observe a negative correlation between embedding separability and ranking accuracy, indicating a trade off between interpretability and performance. These results show that survey derived pseudo labels improve recommendation under extreme sparsity while producing interpretable task specific embedding spaces.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2603.24750 [cs.IR]

(or
arXiv:2603.24750v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.24750

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

计算机视觉

1. 【2603.25746】ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

链接：https://arxiv.org/abs/2603.25746

作者：Yawen Luo,Xiaoyu Shi,Junhao Zhuang,Yutian Chen,Quande Liu,Xintao Wang,Pengfei Wan,Tianfan Xue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long narrative storytelling, bidirectional architectures suffer, causal multi-shot architecture, crucial for long, suffer from limited

备注： Project Page: [this https URL](https://luo0207.github.io/ShotStream/) Code: [this https URL](https://github.com/KlingAIResearch/ShotStream)

点击查看摘要

Abstract:Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

2. 【2603.25745】Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

链接：https://arxiv.org/abs/2603.25745

作者：Yixing Lao,Xuyang Bai,Xiaoyang Wu,Nuoyuan Yan,Zixin Luo,Tian Fang,Jean-Daniel Nahmias,Yanghai Tsin,Shiwei Li,Hengshuang Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Splatting methods predict, Gaussian Splatting methods, Gaussian Splatting, Existing feed-forward, methods predict pixel-aligned

备注：

点击查看摘要

Abstract:Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: this https URL

3. 【2603.25744】MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

链接：https://arxiv.org/abs/2603.25744

作者：Bocheng Zou,Mu Cai,Mark Stanley,Dingfu Lu,Yong Jae Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, offering robust representations, Foundation Models, Vision Foundation, offering robust

备注：

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

4. 【2603.25743】RefAlign: Representation Alignment for Reference-to-Video Generation

链接：https://arxiv.org/abs/2603.25743

作者：Lei Wang,YuXin Song,Ge Wu,Haocheng Feng,Hang Zhou,Jingdong Wang,Yaxing Wang,jian Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：controllable video synthesis, video synthesis paradigm, generation process, enabling applications, virtual try-on

备注： 17 pages, 11 figures

点击查看摘要

Abstract:Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

5. 【2603.25741】Vega: Learning to Drive with Natural Language Instructions

链接：https://arxiv.org/abs/2603.25741

作者：Sicheng Zuo,Yuxuan Li,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：reshaped autonomous driving, reshaped autonomous, incorporate languages, autonomous driving, driving

备注： Code is available at [this https URL](https://github.com/zuosc19/Vega)

点击查看摘要

Abstract:Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

6. 【2603.25740】Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

链接：https://arxiv.org/abs/2603.25740

作者：Zehao Wang,Huaide Jiang,Shuaiwu Dong,Yuping Wang,Hang Qiu,Jiachen Li

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：Human driving behavior, inherently personal, Human driving, Human, driving

备注： IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026); Project website: [this https URL](https://dmw-cvpr.github.io/)

点击查看摘要

Abstract:Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at this https URL.

7. 【2603.25739】MegaFlow: Zero-Shot Large Displacement Optical Flow

链接：https://arxiv.org/abs/2603.25739

作者：Dingxi Zhang,Fangjinhua Wang,Marc Pollefeys,Haofei Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large displacement optical, critical challenge, Accurate estimation, large displacement, remains a critical

备注： Project Page: [this https URL](https://kristen-z.github.io/projects/megaflow) Code: [this https URL](https://github.com/cvg/megaflow)

点击查看摘要

Abstract:Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: this https URL.

8. 【2603.25738】PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

链接：https://arxiv.org/abs/2603.25738

作者：Xincheng Shuai,Song Tang,Yutong Huang,Henghui Ding,Dacheng Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Graphic design, e-commerce and advertising, design, innovative process, process that plays

备注： CVPR 2026, Project Page: [this https URL](https://henghuiding.com/PSDesigner/)

点击查看摘要

Abstract:Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

9. 【2603.25736】How good was my shot? Quantifying Player Skill Level in Table Tennis

链接：https://arxiv.org/abs/2603.25736

作者：Akihiro Kubota,Tomoya Hasegawa,Ryo Kawahara,Ko Nishino

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inherently shapes, skill, Gauging, Gauging an individual, player

备注：

点击查看摘要

Abstract:Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports -- specifically table tennis -- where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player's tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context -- including player positioning and opponent behaviors -- the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.

10. 【2603.25734】Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

链接：https://arxiv.org/abs/2603.25734

作者：Ziyin Wang,Sirui Xu,Chuan Guo,Bing Zhou,Jiangshan Gong,Jian Wang,Yu-Xiong Wang,Liang-Yan Gui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：animations remains challenging, Generating realistic human-object, requires jointly modeling, jointly modeling dynamic, modeling dynamic human

备注： Project Page: [this http URL](http://ziyinwang1.github.io/LIGHT)

点击查看摘要

Abstract:Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

11. 【2603.25733】SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

链接：https://arxiv.org/abs/2603.25733

作者：Jiwook Han,Geo Ahn,Youngrae Kim,Jinwoo Choi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Video Temporal Grounding

备注： Accepted to GRAIL-V workshop at CVPR 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

12. 【2603.25732】BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

链接：https://arxiv.org/abs/2603.25732

作者：Yan Li,Zezi Zeng,Ziwei Zhou,Xin Gao,Muzhao Tian,Yifan Yang,Mingxi Cheng,Qi Dai,Yuqing Yang,Lili Qiu,Zhendong Wang,Zhengyuan Yang,Xue Yang,Lijuan Wang,Ji Li,Chong Luo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, practical visual content, visual content creation, visual content, expanded their applications

备注：

点击查看摘要

Abstract:Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

13. 【2603.25730】PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

链接：https://arxiv.org/abs/2603.25730

作者：Xiaofeng Mao,Shaohao Rui,Kaining Ying,Bo Zheng,Chuanhao Li,Mingmin Chi,Kaipeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Autoregressive video diffusion, linear KV-cache growth, intractable linear KV-cache, demonstrated remarkable progress, video diffusion models

备注：

点击查看摘要

Abstract:Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. this https URL

14. 【2603.25728】PixelSmile: Toward Fine-Grained Facial Expression Editing

链接：https://arxiv.org/abs/2603.25728

作者：Jiabin Hua(1 and 2),Hengyuan Xu(1 and 2),Aojie Li(2),Wei Cheng(2),Gang Yu(2),Xingjun Ma(1),Yu-Gang Jiang(1) ((1) Fudan University, (2) StepFun)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：intrinsic semantic overlap, Flex Facial Expression, facial expression editing, Fine-grained facial expression, facial expression

备注： 21 Pages; Project Page: [this https URL](https://ammmob.github.io/PixelSmile/;) Code: [this https URL](https://github.com/Ammmob/PixelSmile)

点击查看摘要

Abstract:Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

15. 【2603.25726】AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

链接：https://arxiv.org/abs/2603.25726

作者：Chen Si,Yulin Liu,Bo Ai,Jianwen Xie,Rolandos Alexandros Potamias,Chuanxia Zheng,Hao Su

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hand pose estimation, large-scale synthetic dataset, synthetic dataset designed, hand pose, pose estimation

备注：

点击查看摘要

Abstract:We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

16. 【2603.25722】No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

链接：https://arxiv.org/abs/2603.25722

作者：Hai X. Pham,David T. Hoffmann,Ricardo Guerrero,Brais Martinez

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：remain a popular, popular choice, models remain, Contrastive vision-language, obtain

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Contrastive vision-language (VL) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of VL models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic VL capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of VLs: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at this https URL.

17. 【2603.25720】R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

链接：https://arxiv.org/abs/2603.25720

作者：Zirui Zhang,Haoyu Dong,Kexin Pei,Chengzhi Mao

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Robust perception, Robust, sensory modalities, reasoning require consistency, require consistency

备注：

点击查看摘要

Abstract:Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

18. 【2603.25716】Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

链接：https://arxiv.org/abs/2603.25716

作者：Kaijin Chen,Dingkang Liang,Xin Zhou,Yikang Ding,Xiaoqiang Liu,Pengfei Wan,Xiang Bai

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：shown immense potential, primarily treat environments, mechanisms primarily treat, Video world models, physical world

备注：

点击查看摘要

Abstract:Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

19. 【2603.25711】Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

链接：https://arxiv.org/abs/2603.25711

作者：Vishal Narnaware,Animesh Gupta,Kevin Zhai,Zhenyi Wang,Mubarak Shah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Diffusion Large, Large Language Models, Diffusion Large Language, achieve high-concurrency generation, Diffusion Large

备注：

点击查看摘要

Abstract:Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

20. 【2603.25707】RACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

链接：https://arxiv.org/abs/2603.25707

作者：Quynh Phung,Long Mai,Cusuh Ham,Feng Liu,Jia-Bin Huang,Aniruddha Mahapatra

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：original scene content, target object trajectory, alter a target, original scene, study object motion

备注： webpage: [this https URL](https://trace-motion.github.io/)

点击查看摘要

Abstract:We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

21. 【2603.25706】Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

链接：https://arxiv.org/abs/2603.25706

作者：Jinbo Xing,Zeyinzi Jiang,Yuxiang Tuo,Chaojie Mao,Xiaotang Gai,Xi Chen,Jingfeng Zhang,Yulin Pan,Zhen Han,Jie Xiao,Keyu Yan,Chenwei Xie,Chongyang Zhong,Kai Zhu,Tong Shen,Lianghua Huang,Yu Liu,Yujiu Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent unified models, made unprecedented progress, Recent unified, made unprecedented, unprecedented progress

备注： CVPR 2026 Camera-ready, Webpage: [this https URL](https://doubiiu.github.io/projects/WanWeaver)

点击查看摘要

Abstract:Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

22. 【2603.25689】LEMMA: Laplacian pyramids for Efficient Marine SeMAntic Segmentation

链接：https://arxiv.org/abs/2603.25689

作者：Ishaan Gakhar,Laven Srivastava,Sankarshanaa Sagaram,Aditya Kasliwal,Ujjwal Verma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Earth Observation events, coastal Earth Observation, unmanned surface vessels, Earth Observation, Observation events

备注： Accepted at the MaCVi Workshop, CVPR 2026

点击查看摘要

Abstract:Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer-based architectures, face challenges in deployment due to their high computational costs and resource-intensive nature. These limitations hinder the practicality of real-time, low-cost applications in real-world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state-of-the-art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5\%, and inference time by up to 84.65\%, as compared to existing models. Experimental results highlight its effectiveness and real-world applicability, including 93.42\% IoU on the Oil Spill dataset and 98.97\% mIoU on Mastr1325.

Comments:
Accepted at the MaCVi Workshop, CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.25689 [cs.CV]

(or
arXiv:2603.25689v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25689

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

23. 【2603.25686】Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

链接：https://arxiv.org/abs/2603.25686

作者：Yunus Talha Erzurumlu,Jiyong Kwag,Alper Yilmaz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：enabling GPS-denied localization, estimates a camera, enabling GPS-denied, localization and navigation, geo-referenced overhead imagery

备注： 18 pages, 6 figures

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

24. 【2603.25685】Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

链接：https://arxiv.org/abs/2603.25685

作者：Jai Bardhan,Patrik Drozdik,Josef Sivic,Vladimir Petrik

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Action-conditioned robot world, robot action sequence, traditional physics engines, Action-conditioned robot, future video frames

备注： 34 pages, 11 figures, 12 tables

点击查看摘要

Abstract:Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

25. 【2603.25672】Can Users Specify Driving Speed? Bench2Drive-Speed: Benchmark and Baselines for Desired-Speed Conditioned Autonomous Driving

链接：https://arxiv.org/abs/2603.25672

作者：Yuqian Shao,Xiaosong Jia,Langechuan Liu,Junchi Yan

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, remarkable progress, autonomous driving, driving, autonomous driving metrics

备注： Project page: [this https URL](https://thinklab-sjtu.github.io/Bench2Drive-Speed/)

点击查看摘要

Abstract:End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users' desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and Overtake Score, to measure how faithfully policies follow user specifications, while remaining compatible with standard autonomous driving metrics. To enable training of speed-conditioned policies, one approach is to collect expert demonstrations that strictly follow speed requirements, an expensive and unscalable process in the real world. An alternative is to adapt existing regular driving data by treating the speed observed in future frames as the target speed for training. To investigate this, we construct CustomizedSpeedDataset, composed of 2,100 clips annotated with experts demonstrations, enabling systematic investigation of supervision strategies. Our experiments show that, under proper re-annotation, models trained on regular driving data perform comparably to on expert demonstrations, suggesting that speed supervision can be introduced without additional complex real-world data collection. Furthermore, we find that while target-speed following can be achieved without degrading regular driving performance, executing overtaking commands remains challenging due to the inherent difficulty of interactive behaviors. All code, datasets and baselines are available at this https URL

26. 【2603.25661】Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

链接：https://arxiv.org/abs/2603.25661

作者：Wenxuan Song,Jiayi Chen,Shuai Chen,Jingbo Wang,Pengxiang Ding,Han Zhao,Yikai Qin,Xinhu Zheng,Donglin Wang,Yan Wang,Haoang Li

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：reduce adaptation costs, effectively improve performance, pretrained VLA models, standard supervised finetuning, VLA models

备注：

点击查看摘要

Abstract:This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: this https URL

27. 【2603.25636】Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis

链接：https://arxiv.org/abs/2603.25636

作者：Chengshuai Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Toggle, prototyping imaging instruments, Bibliographic Tools Bibliographic, computational imaging system, Toggle Hugging Face

备注： 28 pages, 7 figures, 8 tables, includes Supplementary Information (sections S1-S6)

点击查看摘要

Abstract:Designing a computational imaging system -- selecting operators, setting parameters, validating consistency -- requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce this http URL, a structured specification format, and three autonomous agents -- Plan, Judge, and Execute -- that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs -- composing primitives into chains from 3D to 5D -- demonstrate compositional reach beyond any single-modality tool.

Comments:
28 pages, 7 figures, 8 tables, includes Supplementary Information (sections S1-S6)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68U10, 65F22, 94A08

ACMclasses:
I.4.5; I.2.2; J.3

Cite as:
arXiv:2603.25636 [cs.CV]

(or
arXiv:2603.25636v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25636

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Chengshuai Yang [view email] [v1]
Thu, 26 Mar 2026 16:47:27 UTC (325 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis, by Chengshuai YangView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

|
next

new
|
recent
| 2026-03

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

28. 【2603.25629】LanteRn: Latent Visual Structured Reasoning

链接：https://arxiv.org/abs/2603.25629

作者：André G. Viveiros,Nuno Gonçalves,Matthias Lindemann,André Martins

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：current large multimodal, reasoning remains challenging, remains challenging, challenging for current, current large

备注：

点击查看摘要

Abstract:While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.

29. 【2603.25613】Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

链接：https://arxiv.org/abs/2603.25613

作者：Ünsal Öztürk,Hatef Otroshi Shahreza,Sébastien Marcel

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注： Accepted in CVPR 2026 workshops

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

30. 【2603.25607】DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

链接：https://arxiv.org/abs/2603.25607

作者：Zhenchen Zhu,Ge Hu,Weixiong Tan,Kai Gao,Chao Sun,Zhen Zhou,Kepei Xu,Wei Han,Meixia Shang,Xiaoming Qiu,Yiqing Tan,Jinhua Wang,Zhoumeng Ying,Li Peng,Wei Song,Lan Song,Zhengyu Jin,Nan Hong,Yizhou Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：detected lung nodules, clinical trial, widespread adoption, notably increased, increased the number

备注： 28 pages for main text and 37 pages for supplementary information, 7 figures in main text and 9 figures in supplementary information

点击查看摘要

Abstract:The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

31. 【2603.25580】UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation

链接：https://arxiv.org/abs/2603.25580

作者：Chengfeng Zhao,Junbo Qi,Yulou Liu,Zhiyang Dou,Minchen Li,Taku Komura,Ziwei Liu,Wenping Wang,Yuan Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Simulating physically realistic, virtual immersive experience, Simulating physically, physically realistic garment, physics simulation methods

备注： Project page: [this https URL](https://igl-hkust.github.io/UNIC/)

点击查看摘要

Abstract:Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.

32. 【2603.25573】Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference

链接：https://arxiv.org/abs/2603.25573

作者：Sk Miraj Ahmed,Xi Yu,Yunqi Li,Yuewei Lin,Wei Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Accurate biodiversity identification, large-scale field data, impact on ecology, environmental monitoring, field data

备注： Accepted at the ICLR 2026 Workshop on Foundation Models for Science (FM4Science)

点击查看摘要

Abstract:Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

33. 【2603.25565】GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

链接：https://arxiv.org/abs/2603.25565

作者：Xuran Hu,Zhitong Xiong,Zhongcheng Hong,Yifang Ban,Xiaoxiang Zhu,Wufan Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current Large Multimodal, Earth Observation typically, Large Multimodal Models, Observation typically neglect, Current Large

备注： 18 pages, 4 figures

点击查看摘要

Abstract:Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.

34. 【2603.25555】owards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

链接：https://arxiv.org/abs/2603.25555

作者：Nikolo Rohrmoser,Ghazal Ghazaei,Michael Sommersperger,Nassir Navab

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：operating rooms paves, rooms paves, Purpose, OPMI, multimodal imaging

备注：

点击查看摘要

Abstract:Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $\mu m$ (OPMI only) to 33 $\mu m$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.25555 [cs.CV]

(or
arXiv:2603.25555v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25555

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Nikolo Rohrmoser [view email] [v1]
Thu, 26 Mar 2026 15:27:27 UTC (2,072 KB)

35. 【2603.25539】PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos

链接：https://arxiv.org/abs/2603.25539

作者：Yihao Wang,Yang Miao,Wenshuai Zhao,Wenyan Yang,Zihan Wang,Joni Pajarinen,Luc Van Gool,Danda Pani Paudel,Juho Kannala,Xi Wang,Arno Solin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Articulation perception aims, drawers and cupboards, scene understanding, understanding in robotics, perception aims

备注： 32 pages, 13 figures. Project page: [this https URL](https://aaltoml.github.io/PAWS/)

点击查看摘要

Abstract:Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at this https URL.

36. 【2603.25535】Insights on back marking for the automated identification of animals

链接：https://arxiv.org/abs/2603.25535

作者：David Brunner,Marie Bordes,Elisabeth Mayrhuber,Stephan M. Winkler,Viktoria Dorfer,Maciej Oczak

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：back mark design, back marks, uniform looking species, back, back mark

备注：

点击查看摘要

Abstract:To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model's predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.

37. 【2603.25533】BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

链接：https://arxiv.org/abs/2603.25533

作者：Ning Ding,Keisuke Fujii,Toru Tamaki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding tactical dynamics, requires analyzing entire, Understanding tactical, badminton requires analyzing, badminton requires

备注： CVSports2026 accepted

点击查看摘要

Abstract:Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.

38. 【2603.25527】Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

链接：https://arxiv.org/abs/2603.25527

作者：Xiangyang Luo,Qingyu Li,Yuming Li,Guanbo Huang,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Shao-Lun Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved impressive results, Recent advances, data, visual quality, impressive results

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

39. 【2603.25524】CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

链接：https://arxiv.org/abs/2603.25524

作者：Alex Hoi Hang Chan,Neha Singhal,Onur Kocahan,Andrea Meltzer,Saverio Lubrano,Miyako H. Warrington,Michel Griesser,Fumihiro Kano,Hemal Naik

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Long-term behavioral monitoring, studying behavioral, time scales, evolutionary biology, crucial for studying

备注： 8 pages, 4 figures

点击查看摘要

Abstract:Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

40. 【2603.25510】Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

链接：https://arxiv.org/abs/2603.25510

作者：Koldo Basterretxea,Jon Gutiérrez-Zaballa,Javier Echanobe

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：hyperspectral imaging, autonomous driving, faces many challenges, challenges related, application domain

备注：

点击查看摘要

Abstract:The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.

41. 【2603.25502】RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

链接：https://arxiv.org/abs/2603.25502

作者：Yufeng Yang,Xianfang Zeng,Zhangqi Jiang,Fukun Yin,Jianzhuang Liu,Wei Cheng,jinghong lan,Shiyu Liu,Yuqi Peng,Gang YU,Shifeng Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Nano Banana Pro, object detection, critical for downstream, autonomous driving, driving and object

备注： 27 pages, 15 figures, Project homepage: [this https URL](https://yfyang007.github.io/RealRestorer/)

点击查看摘要

Abstract:Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.

42. 【2603.25499】Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects

链接：https://arxiv.org/abs/2603.25499

作者：Jakob Paul Zimmermann,Gerrit Holzbach,David Lerch

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：missing pedestrians, Guided Failure Prediction, fail silently, emitting any warning, Knowledge Guided Failure

备注：

点击查看摘要

Abstract:Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at this https URL.

43. 【2603.25494】AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

链接：https://arxiv.org/abs/2603.25494

作者：Xuzhi Wang,Xinran Wu,Song Wang,Lingdong Kong,Ziping Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantic scene completion, monocular semantic scene, outdoor counterpart due, Indoor monocular semantic, indoor MSSC

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: this https URL.

44. 【2603.25467】GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids

链接：https://arxiv.org/abs/2603.25467

作者：Mohamed Eltahir,Ahmed O. Ibrahim,Obada Siralkhatim,Tabarak Abdallah,Sondos Mohamed

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hallucinated false alarms, Vision-Language Models, powerful open-set reasoners, calibrated anomaly priors, surveillance is fragile

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation this http URL and qualitative video results are available at this https URL.

45. 【2603.25463】CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

链接：https://arxiv.org/abs/2603.25463

作者：Keming Ye,Zhou Zhao,Fan Wu,Shengyu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieving performance comparable, recently made notable, made notable progress, models have recently, achieving performance

备注： 23 pages, 10 tables, 7 figures

点击查看摘要

Abstract:Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.

46. 【2603.25442】DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming

链接：https://arxiv.org/abs/2603.25442

作者：Wei Lian,Fei Ma,Hang Pan,Zhesen Cui,Wangmeng Zuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large misalignments remains, Achieving globally optimal, optimal point cloud, point cloud registration, globally optimal point

备注：

点击查看摘要

Abstract:Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ($\boldsymbol{\theta}$) and correspondence ($\mathbf{P}$) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between $\boldsymbol{\theta}$ and $\mathbf{P}$. This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.

47. 【2603.25420】VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

链接：https://arxiv.org/abs/2603.25420

作者：George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Yang Bai,Liudi Yang,Ziyuan Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：additional data collection, enabled realistic resimulation, Recent progress, pretrained robot policies, data collection

备注：

点击查看摘要

Abstract:Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.25420 [cs.CV]

(or
arXiv:2603.25420v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25420

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

48. 【2603.25411】HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

链接：https://arxiv.org/abs/2603.25411

作者：Huizhi Liang,Yichao Shen,Yu Deng,Sicheng Xu,Zhiyuan Feng,Tong Zhang,Yaobo Liang,Jiaolong Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Achieving human-like spatial, Achieving human-like, recognizing object properties, performing high-level spatial, requires inferring

备注： Accepted by CVPR 2026. Project page: [this https URL](https://microsoft.github.io/HiSpatial)

点击查看摘要

Abstract:Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

49. 【2603.25399】LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

链接：https://arxiv.org/abs/2603.25399

作者：Xinkai Wang,Chenyi Wang,Yifu Xu,Mingzhe Ye,Fu-Cheng Zhang,Jialin Tian,Xinyu Zhan,Lifeng Zhu,Cewu Lu,Lixin Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：framework that embeds, embeds dense, robotic manipulation, Motion Expert, Action Expert

备注：

点击查看摘要

Abstract:We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at this https URL.

50. 【2603.25398】PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

链接：https://arxiv.org/abs/2603.25398

作者：Niccolò Cavagnero,Narges Norouzi,Gijs Dubbelman,Daan de Geus

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, Vision Foundation, downstream tasks simultaneously, serve multiple downstream, multiple downstream tasks

备注： 8 pages, ECV 2026, CVPR Workshop

点击查看摘要

Abstract:Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: this https URL.

51. 【2603.25389】FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection

链接：https://arxiv.org/abs/2603.25389

作者：Yingmei Zhang,Wangtao Bao,Yong Yang,Weiguo Wan,Qin Xiao,Xueting Zou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Infrared small target, Infrared small, aims to identify, identify and distinguish, distinguish small targets

备注：

点击查看摘要

Abstract:Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network's sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on this https URL.

52. 【2603.25388】Multimodal Dataset Distillation via Phased Teacher Models

链接：https://arxiv.org/abs/2603.25388

作者：Shengbin Guo,Hang Zhao,Senqiao Yang,Chenyang Jiang,Yuhang Cheng,Xiangru Peng,Rui Shao,Zhuotao Tian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal dataset distillation, compact synthetic datasets, construct compact synthetic, enable efficient compression, Multimodal dataset

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) -- a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher's learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: this https URL.

53. 【2603.25383】CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

链接：https://arxiv.org/abs/2603.25383

作者：Jeannie Chung,Hanna Jang,Ingyeong Yang,Uiwon Hwang,Jaehyung Sim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strong zero-shot generalization, demonstrates strong zero-shot, CLIP aligns image, zero-shot generalization, aligns image

备注：

点击查看摘要

Abstract:CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

54. 【2603.25366】Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics

链接：https://arxiv.org/abs/2603.25366

作者：João Castelo-Branco,José Santos-Victor,Alexandre Bernardino

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：mobile robots operating, Autonomous object search, Autonomous object, deep reinforcement learning, challenging for mobile

备注： Accepted and to be published in the ICARSC 2026 26th IEEE International Conference on Autonomous Robot Systems and Competitions

点击查看摘要

Abstract:Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies. Across two indoor environments, the proposed method improves success rate while reducing search effort. Overall, the results support the value of combining Bayesian belief estimation with learned action selection to achieve more efficient and reliable objectsearch behavior under partial observability.

55. 【2603.25357】InstanceAnimator: Multi-Instance Sketch Video Colorization

链接：https://arxiv.org/abs/2603.25357

作者：Yinhan Zhang,Yue Ma,Bingyuan Wang,Kunyu Feng,Yeying Jin,Qifeng Chen,Anyi Rao,Zeyu Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Transformer framework, Transformer framework, Diffusion Transformer, multi-instance sketch video, sketch video colorization

备注：

点击查看摘要

Abstract:We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

56. 【2603.25351】Image Rotation Angle Estimation: Comparing Circular-Aware Methods

链接：https://arxiv.org/abs/2603.25351

作者：Maximilian Woehrer

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：Automatic image rotation, key preprocessing step, Automatic image, image rotation estimation, circular Gaussian distribution

备注： 7 pages, 3 figures, 2 tables. Under review at Pattern Recognition Letters

点击查看摘要

Abstract:Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.

57. 【2603.25336】HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

链接：https://arxiv.org/abs/2603.25336

作者：Yongsung Kim,Wooseok Song,Jaihyun Lew,Hun Hwangbo,Jaehoon Lee,Sungroh Yoon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Geometry Grounded, Geometry Grounded Transformer, Visual Geometry, Grounded Transformer, Geometry Grounded

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at this https URL.

58. 【2603.25319】MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

链接：https://arxiv.org/abs/2603.25319

作者：Zhekai Chen,Yuqing Wang,Manyuan Zhang,Xihui Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generating images conditioned, current models suffer, severe performance degradation, multiple visual references, Generating images

备注： Project Page: [this https URL](https://macro400k.github.io/)

点击查看摘要

Abstract:Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

59. 【2603.25316】Adaptive Learned Image Compression with Graph Neural Networks

链接：https://arxiv.org/abs/2603.25316

作者：Yunuo Chen,Bing He,Zezheng Lyu,Hongwei Hu,Qunshan Gu,Yuan Tian,Guo Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image compression, image compression relies, learned image compression, relies on modeling, image

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model's ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at this https URL.

60. 【2603.25296】owards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework

链接：https://arxiv.org/abs/2603.25296

作者：Hongru Han,Tingrui Guo,Liming Zhang,Yan Su,Qiwen Xu,Zhuohua Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Low-light image enhancement, Controllable Low-light Enhancement, deterministic mapping, traditionally been formulated, Low-light image

备注： 10 pages, 8 figures

点击查看摘要

Abstract:Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating "gt-mean" post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the "scanning gap" inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.

61. 【2603.25275】V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception

链接：https://arxiv.org/abs/2603.25275

作者：Weijia Li,Haoen Xiang,Tianxu Wang,Shuaibing Wu,Qiming Xia,Cheng Wang,Chenglu Wen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Modern autonomous vehicle, Modern autonomous, blind spots, limited sensing range, vehicle perception systems

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at this https URL.

62. 【2603.25267】EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

链接：https://arxiv.org/abs/2603.25267

作者：Yuhan Chen,Pengwen Dai,Chuan Wang,Dayan Wu,Xiaochun Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significant improvements due, large-scale vision-language pre-trained, Text-video retrieval tasks, Fine-Grained Relationship Learning, vision-language pre-trained models

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at this https URL.

63. 【2603.25265】ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis

链接：https://arxiv.org/abs/2603.25265

作者：Moonyeon Jeong,Seunggi Min,Suhyeon Lee,Hongje Seong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unposed images, Gaussian, synthesis from unposed, Gaussian splatting, Gaussian splatting network

备注： 24 pages, 10 figures

点击查看摘要

Abstract:We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).

64. 【2603.25260】owards Practical Lossless Neural Compression for LiDAR Point Clouds

链接：https://arxiv.org/abs/2603.25260

作者：Pengpeng Yu,Haoran Li,Runqing Jiang,Dingquan Li,Jing Wang,Liang Lin,Yulan Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：LiDAR point clouds, efficient context modeling, hinders efficient context, high-precision geometric details, geometric details hinders

备注：

点击查看摘要

Abstract:LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at this https URL.

65. 【2603.25255】Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection

链接：https://arxiv.org/abs/2603.25255

作者：Md Awsafur Rahman,Chandrakanth Gudavalli,Hardik Prajapati,B. S. Manjunath

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：urban mobility analysis, detection underpins applications, Trajectory anomaly detection, anomaly detection underpins, underpins applications

备注：

点击查看摘要

Abstract:Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.

66. 【2603.25250】Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

链接：https://arxiv.org/abs/2603.25250

作者：Yabin Zhang,Maya Varma,Yunhe Gao,Jean-Benoit Delbrouck,Jiaming Liu,Chong Wang,Curtis Langlotz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：detection aims, deviate from in-distribution, OOD, aims to identify, detecting OOD based

备注： CVPR 2026 main track, Codes are available at [this https URL](https://github.com/YBZh/OpenOOD-VLM)

点击查看摘要

Abstract:Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5\% to 9.8\%. Codes are available at \href{this https URL}{YBZh/OpenOOD-VLM}.

67. 【2603.25249】Semantic-Aware Prefix Learning for Token-Efficient Image Generation

链接：https://arxiv.org/abs/2603.25249

作者：Qingfeng Li,Haoxian Zhang,Xu He,Songlin Tang,Zhixue Fang,Xiaoqiang Liu,Pengfei Wan Guoqi Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual tokenizers play, tractable generative modeling, bridging high-dimensional images, Visual tokenizers, generative modeling

备注：

点击查看摘要

Abstract:Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

68. 【2603.25247】FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics

链接：https://arxiv.org/abs/2603.25247

作者：Taejin Jeong,Joohyeok Kim,Jinyeong Kim,Chanyoung Kim,Seong Jae Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：offering crucial insights, Spatial Transcriptomics, spatially-resolved gene expression, spatial gene expression, offering crucial

备注：

点击查看摘要

Abstract:Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at this https URL FEAST.

69. 【2603.25244】Efficient Preemptive Robustification with Image Sharpening

链接：https://arxiv.org/abs/2603.25244

作者：Jiaming Liang,Chi-Man Pun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep neural networks, neural networks rely, non-robust representations, great success, deep neural

备注：

点击查看摘要

Abstract:Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.

70. 【2603.25230】A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

链接：https://arxiv.org/abs/2603.25230

作者：Jiaming Liang,Chi-Man Pun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Transformation-based adversarial attacks, deceiving classification models, Transformation-based adversarial, demonstrate strong transferability, classification models

备注：

点击查看摘要

Abstract:Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

71. 【2603.25229】An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models

链接：https://arxiv.org/abs/2603.25229

作者：Sazzad Hossain,Saiful Islam,Muhammad Ibrahim,Md. Rasel Ahmed,Md Shuayb,Ahmedul Kabir

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：health concern worldwide, Skin diseases, major public health, public health concern, Skin

备注： 14 pages

点击查看摘要

Abstract:Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.

72. 【2603.25228】raining-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

链接：https://arxiv.org/abs/2603.25228

作者：Jonas Hein,Lilian Calvet,Matthias Seibold,Siyu Tang,Marc Pollefeys,Philipp Fürnstahl

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer-assisted interventions, pose estimation, Purpose, surgical, pose

备注： Accepted at IJCARS: IPCAI 2026

点击查看摘要

Abstract:Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

73. 【2603.25218】SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment

链接：https://arxiv.org/abs/2603.25218

作者：Pengyu Chen,Haotian Sa,Yiwei Hu,Yuhan Cheng,Junbo Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Detecting small unmanned, cluttered aerial backgrounds, unmanned aerial vehicles, perspective presents significant, presents significant challenges

备注：

点击查看摘要

Abstract:Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.

74. 【2603.25209】Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

链接：https://arxiv.org/abs/2603.25209

作者：Jiahao Tian,Chenxi Song,Wei Cheng,Chi Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Generating long videos, Generating long, video diffusion models, pre-trained video diffusion, short clips

备注： Accepted to CVPR 2026. Code: [this https URL](https://github.com/Westlake-AGI-Lab/FreeLOC)

点击查看摘要

Abstract:Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at this https URL.

75. 【2603.25203】Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

链接：https://arxiv.org/abs/2603.25203

作者：Ruichao Yang,Wei Gao,Xiaobin Zhu,Jing Ma,Hongzhan Lin,Ziyang Luo,Bo-Wen Zhang,Xu-Cheng Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：evades traditional detectors, opaque black boxes, Multimodal misinformation poses, Probabilistic Concept Graph, traditional detectors

备注： Accepted by CVPR 2026

点击查看摘要

76. 【2603.25202】CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging

链接：https://arxiv.org/abs/2603.25202

作者：Shaojin Bai,Yuting Su,Weizhi Nie

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Cross-site generalizability, non-randomly dictate hospital, Conventional Domain Generalization, non-randomly dictate, fundamentally compromised

备注： 10 pages, 2 figures

点击查看摘要

Abstract:Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.

77. 【2603.25199】acSIm: A Dataset and Benchmark for Football Tactical Style Imitation

链接：https://arxiv.org/abs/2603.25199

作者：Peng Wen,Yuting Wang,Qiurui Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mize reward-based objectives, win rate proxies, research primarily aims, opti mize reward-based, accurately replicat ing

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.

78. 【2603.25194】CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis

链接：https://arxiv.org/abs/2603.25194

作者：Marvin Seyfarth,Sarah Kaye Müller,Arman Ghanaat,Isabelle Ayx,Fabian Fastenrath,Philipp Wild,Alexander Hertel,Theano Papavassiliu,Salman Ul Hassan Dar,Sandy Engelhardt

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently achieved strong, achieved strong performance, recently achieved, achieved strong, strong performance

备注：

点击查看摘要

Abstract:Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at this https URL.

79. 【2603.25188】AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

链接：https://arxiv.org/abs/2603.25188

作者：Jiahao Wang,Hualian Sheng,Sijia Cai,Yuxiao Yang,Weizhan Zhang,Caixia Yan,Bing Deng,Jieping Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offers powerful tools, Identity-preserving video generation, generation offers powerful, customize videos featuring, allowing users

备注：

点击查看摘要

Abstract:Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.

80. 【2603.25181】VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

链接：https://arxiv.org/abs/2603.25181

作者：Marvin Seyfarth,Salman Ul Hassan Dar,Yannik Frisch,Philipp Wild,Norbert Frey,Florian André,Sandy Engelhardt

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image synthesis, medical image, high-fidelity medical image, image synthesis, volumetric medical image

备注：

点击查看摘要

Abstract:Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at this https URL.

81. 【2603.25178】Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

链接：https://arxiv.org/abs/2603.25178

作者：Wanjiang Weng,Xiaofeng Tan,Xiangbo Shu,Guo-Sen Xie,Pan Zhou,Hongsong Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：holds significant potential, generation holds significant, cross-linguistic applications, holds significant, significant potential

备注： 11 pages, 7 figures

点击查看摘要

82. 【2603.25175】AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

链接：https://arxiv.org/abs/2603.25175

作者：Md Mushfiqur Azam,John Quarles,Kevin Desai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：severe perspective distortion, remains challenging due, limited body visibility, estimation remains challenging, human pose estimation

备注：

点击查看摘要

Abstract:Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: this https URL.

83. 【2603.25170】Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

链接：https://arxiv.org/abs/2603.25170

作者：Shiji Zhao,Shukun Xiong,Maoxun Yuan,Yao Huang,Ranjie Duan,Qing Guo,Jiansheng Chen,Haibin Duan,Xingxing Wei

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：infrared object detection, exhibits broad applicability, object detection, detection exhibits broad, object detection exhibits

备注： Accepted for publication in the International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.

84. 【2603.25168】ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

链接：https://arxiv.org/abs/2603.25168

作者：Xike Zhang,Maoyuan Ye,Juhua Liu,Bo Du

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Segment Anything Model, scene text detection, unified scene text, achieved promising performance, based on Segment

备注： 20 pages, 8 figures, 8 tables. Submitted to ECCV 2026

点击查看摘要

Abstract:Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across this http URL experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

85. 【2603.25165】owards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

链接：https://arxiv.org/abs/2603.25165

作者：Bin Yang,Mohamed Abdelsamad,Miao Zhang,Alexandru Paul Condurache

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, substantially improved, human annotations, SSL, Recent

备注： The paper was accepted by CVPR2026

点击查看摘要

Abstract:Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

86. 【2603.25163】SportSkills: Physical Skill Learning from Sports Instructional Videos

链接：https://arxiv.org/abs/2603.25163

作者：Kumar Ashutosh,Chi Hsuan Wu,Kristen Grauman

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：general human activity, Current large-scale video, Current large-scale, physical skill learning, fine-grained activities needed

备注： Technical report

点击查看摘要

Abstract:Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

87. 【2603.25159】A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

链接：https://arxiv.org/abs/2603.25159

作者：SuYeon Kim,Wongyu Lee,MyeongAh Cho

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：point clouds trained, clouds trained solely, anomaly detection targets, point clouds, normal data

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

88. 【2603.25157】Vision Hopfield Memory Networks

链接：https://arxiv.org/abs/2603.25157

作者：Jianfeng Wang,Amine M'Charrak,Luk Koska,Xiangtao Wang,Daniel Petriceanu,Mykyta Smyrnov,Ruizhi Wang,Michael Bumbar,Luca Pinchetti,Thomas Lukasiewicz

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词：Transformer families, enabling unified modeling, achieved remarkable progress, Recent vision, Hopfield Memory Network

备注：

点击查看摘要

Abstract:Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.

89. 【2603.25155】Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

链接：https://arxiv.org/abs/2603.25155

作者：Chengyu Fang,Heng Guo,Zheng Jiang,Chunming He,Xiu Li,Minfeng Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal large language, large language models, Multimodal large, imaging is hindered, large language

备注： Accepted by ICLR 2026

点击查看摘要

Abstract:Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

90. 【2603.25145】Learning to Rank Caption Chains for Video-Text Alignment

链接：https://arxiv.org/abs/2603.25145

作者：Ansel Blume,Burak Uzkent,Shalini Chaudhuri,Garin Kessler

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Direct preference optimization, Direct preference, train language models, technique to train, train language

备注：

点击查看摘要

Abstract:Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.

91. 【2603.25144】FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation

链接：https://arxiv.org/abs/2603.25144

作者：Hongxu Ma,Guang Li,Shijie Wang,Dongzhan Zhou,Baoli Sun,Takahiro Ogawa,Miki Haseyama,Zhihui Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：small synthetic set, large training set, training set, shown strong results, training cost

备注：

点击查看摘要

Abstract:Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.

92. 【2603.25140】SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

链接：https://arxiv.org/abs/2603.25140

作者：Sahibzada Adil Shahzad,Ammarah Hashmi,Junichi Yamagishi,Yusuke Yasuda,Yu Tsao,Chia-Wen Lin,Yan-Tsung Peng,Hsin-Min Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)

关键词：exhibit subtle visual, curated synthetic forgeries, exhibit subtle, remain challenging, detectors are trained

备注：

点击查看摘要

Abstract:Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

93. 【2603.25135】EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

链接：https://arxiv.org/abs/2603.25135

作者：Taegyoon Yoon,Yegyu Han,Seojin Ji,Jaewoo Park,Sojeong Kim,Taein Kwon,Hyung-Sin Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Smart glass, insights under hands-busy, glass is emerging, plenty of insights, object pose estimation

备注： Camera ready version for CVPR 2026, appendix included

点击查看摘要

Abstract:Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at this https URL

94. 【2603.25132】Robust Principal Component Completion

链接：https://arxiv.org/abs/2603.25132

作者：Yinjian Wang,Wei Li,Yuanyuan Gui,James E. Fowler,Gemine Vivone

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：principal component analysis, Robust principal component, RPCA, component analysis, Robust principal

备注：

点击查看摘要

Abstract:Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at this https URL.

95. 【2603.25131】Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation

链接：https://arxiv.org/abs/2603.25131

作者：Yaowen Chang,Zhen Cao,Xu Zheng,Xiaoxin Mi,Zhen Dong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Panoramic semantic segmentation, pivotal for comprehensive, scene understanding, virtual reality, Unsupervised Domain Adaptation

备注： Accepted to CVPR26

点击查看摘要

Abstract:Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

96. 【2603.25129】AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.25129

作者：Minh-Quan Viet Bui,Jaeho Moon,Munchurl Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, demonstrated remarkable zero-shot, remarkable zero-shot capabilities, Foundation Models, Vision Foundation

备注： Project page: [this https URL](https://kaist-viclab.github.io/airsplat-site)

点击查看摘要

Abstract:While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

Comments:
Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.25129 [cs.CV]

(or
arXiv:2603.25129v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25129

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Minh-Quan Viet Bui [view email] [v1]
Thu, 26 Mar 2026 07:52:33 UTC (12,943 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting, by Minh-Quan Viet Bui and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

|
next

new
|
recent
| 2026-03

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

97. 【2603.25118】AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

链接：https://arxiv.org/abs/2603.25118

作者：Jiawei Lin,Wanrong Zhu,Vlad I Morariu,Christopher Tensmeyer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained growing attention, AI-driven content creation, gained growing, growing attention, field of AI-driven

备注： CVPR 2026 Main Conference

点击查看摘要

Abstract:Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

98. 【2603.25109】MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

链接：https://arxiv.org/abs/2603.25109

作者：Yuto Matsuo,Yoshihiro Fukuhara,Yuki M. Asano,Rintaro Yanagi,Hirokatsu Kataoka,Akio Nakamura

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：key technique, technique for improving, image classification models, augmentation, classification models

备注：

点击查看摘要

Abstract:Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

99. 【2603.25108】MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

链接：https://arxiv.org/abs/2603.25108

作者：Chenglong Wang,Yifu Huo,Yang Gan,Qiaozhi He,Qi Meng,Bei Li,Yan Wang,Junfu Liu,Tianhua Zhou,Jingbo Zhu,Tong Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, multimodal reward modeling, largely driven, shift from discriminative, multimodal

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: this https URL.

100. 【2603.25107】Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning

链接：https://arxiv.org/abs/2603.25107

作者：Yuqiao Zeng,Xu Wang,Tengfei Liang,Yiqing Hao,Yi Jin,Hui Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale labeled data, integrates complementary information, learning integrates complementary, improve model performance, Multimodal learning integrates

备注：

点击查看摘要

Abstract:Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.

101. 【2603.25091】Pixelis: Reasoning in Pixels, from Seeing to Acting

链接：https://arxiv.org/abs/2603.25091

作者：Yunpeng Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：improve under shift, vision-language systems, safely improve, static observers, describe pixels

备注： 28pages, 16figures, 18tables

点击查看摘要

Abstract:Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

102. 【2603.25089】HEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

链接：https://arxiv.org/abs/2603.25089

作者：Tzu-Yen Ma,Bo Zhang,Zichen Tang,Junpeng Ding,Haolin Tian,Yuanze Li,Zhuodi Hao,Zixin Ding,Zirui Wang,Xinyu Yu,Shiyao Peng,Yizhuo Zhao,Ruomeng Jiang,Yiling Huang,Peizhi Zhao,Jiayuan Chen,Weisheng Tan,Haocheng Gao,Yang Liu,Jiacheng Liu,Zhongjun Yang,Jiayu Huang,Haihong E

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：comprehensively evaluate multimodal, evaluate multimodal large, multimodal large language, multi-task benchmark designed, large language models

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

103. 【2603.25088】Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

链接：https://arxiv.org/abs/2603.25088

作者：Chengxu Yang,Jingling Yuan,Chuang Hu,Jiawei Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注：

点击查看摘要

Abstract:Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.

104. 【2603.25083】Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization

链接：https://arxiv.org/abs/2603.25083

作者：Haoran Pei,Yuguang Yang,Kexin Liu,Juan Zhang,Baochang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：image analysis systems, evaluating image analysis, analysis systems, primary metric, metric for evaluating

备注：

点击查看摘要

Abstract:Out-of-Distribution (OOD) generalization has become a primary metric for evaluating image analysis systems. Since deep learning models tend to capture domain-specific context, they often develop shortcut dependencies on these non-causal features, leading to inconsistent performance across different data sources. Current techniques, such as invariance learning, attempt to mitigate this. However, they struggle to isolate highly mixed features within deep latent spaces. This limitation prevents them from fully resolving the shortcut learning this http URL this paper, we propose Hierarchical Causal Dropout (HCD), a method that uses channel-level causal masks to enforce feature sparsity. This approach allows the model to separate causal features from spurious ones, effectively performing a causal intervention at the representation level. The training is guided by a Matrix-based Mutual Information (MMI) objective to minimize the mutual information between latent features and domain labels, while simultaneously maximizing the information shared with class this http URL ensure stability, we incorporate a StyleMix-driven VICReg module, which prevents the masks from accidentally filtering out essential causal data. Experimental results on OOD benchmarks show that HCD performs better than existing top-tier methods.

105. 【2603.25077】Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

链接：https://arxiv.org/abs/2603.25077

作者：Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Guoyin Wang,Jiancan Wu,Xiang Wang,Xiangnan He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Extending Reinforcement Learning, Extending Reinforcement, Verifiable Rewards, Reinforcement Learning, Learning with Verifiable

备注：

点击查看摘要

Abstract:Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities -- visual grounding and symbolic reasoning -- making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.

106. 【2603.25074】Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

链接：https://arxiv.org/abs/2603.25074

作者：Nanxiang Jiang,Zhaoxin Fan,Baisen Wang,Daiheng Gao,Junhang Cheng,Jifeng Guo,Yalan Qin,Yeying Jin,Hongwei Zheng,Faguo Wu,Wenjun Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vital safety mechanism, removing unwanted concepts, Concept erasure serves, vital safety, safety mechanism

备注：

点击查看摘要

Abstract:Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

107. 【2603.25072】GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

链接：https://arxiv.org/abs/2603.25072

作者：Junpeng Ma,Sashuai Zhou,Guanghao Li,Xin Gao,Yue Cao,Hengyu Zeng,Yuxiang Yan,Zhibin Wang,Jun Song,Bo Zheng,Shanghang Zhang,Jian Pu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Video Large Language, Language Models, Large Language, achieved remarkable success

备注： 11 pages, 3 figures

点击查看摘要

Abstract:Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

108. 【2603.25058】Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos

链接：https://arxiv.org/abs/2603.25058

作者：Xuankai Zhang,Junjin Xiao,Shangwei Huang,Wei-shi Zheng,Qing Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic Gaussian Splatting, high-quality dynamic Gaussian, Gaussian Splatting, Splatting from monocular, monocular videos

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at this https URL.

109. 【2603.25054】Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics

链接：https://arxiv.org/abs/2603.25054

作者：Jing Tao,Taihang Lei,Banglei Guan,Ying Qu,Xudong Na,Likun Ma,Yang Shang,Qifeng Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Real-time monitoring, monitoring of high-energy, Real-time, HDR, Event

备注：

点击查看摘要

Abstract:Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event--SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.

110. 【2603.25053】GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

链接：https://arxiv.org/abs/2603.25053

作者：Liyuan Zhu,Manjunath Narayana,Michal Stary,Will Hutchcroft,Gordon Wetzstein,Iro Armeni

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：approach for improving, geometry-informed video generation, Gaussian splatting, video generation, present GaussFusion

备注： CVPR 2026 main paper camera-ready. Project page: [this http URL](http://research.zhuliyuan.net/projects/GaussFusion/)

点击查看摘要

Abstract:We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.

111. 【2603.25042】MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

链接：https://arxiv.org/abs/2603.25042

作者：Wonjoon Lee,Sungmin Woo,Donghyeong Kim,Jungho Lee,Sangheon Park,Sangyoun Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：streaming multi-view inputs, per-Gaussian motion, motion, low-latency constraints, dynamic scenes aims

备注：

点击查看摘要

Abstract:Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.

112. 【2603.25040】Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

链接：https://arxiv.org/abs/2603.25040

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：multimodal foundation model, scientific multimodal foundation, multimodal foundation, foundation model, scientific multimodal

备注：

点击查看摘要

113. 【2603.25037】GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation

链接：https://arxiv.org/abs/2603.25037

作者：Jianbo Qi,Mengyao Li,Baogui Jiang,Yidan Chen,Qiao Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)

关键词：Satellite Earth observation, monitoring environmental change, discrete raster files, Satellite Earth, planetary-scale Earth observation

备注： 22 pages, 7 figures

点击查看摘要

Abstract:Satellite Earth observation has accumulated massive spatiotemporal archives essential for monitoring environmental change, yet these remain organized as discrete raster files, making them costly to store, transmit, and query. We present GeoNDC, a queryable neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling on-demand queries and continuous-time reconstruction without full decompression. Experiments on a 20-year global MODIS MCD43A4 reflectance record (7 bands, 5\,km, 8-day sampling) show that the learned representation supports direct spatiotemporal queries on consumer hardware. On Sentinel-2 imagery (10\,m), continuous temporal parameterization recovers cloud-free dynamics with high fidelity ($R^2 0.85$) under simulated 2-km cloud occlusion. On HiGLASS biophysical products (LAI and FPAR), GeoNDC attains near-perfect accuracy ($R^2 0.98$). The representation compresses the 20-year MODIS archive to 0.44\,GB -- approximately 95:1 relative to an optimized Int16 baseline -- with high spectral fidelity (mean $R^2 0.98$, mean RMSE $= 0.021$). These results suggest GeoNDC offers a unified AI-native representation for planetary-scale Earth observation, complementing raw archives with a compact, analysis-ready data layer integrating query, reconstruction, and compression in a single framework.

114. 【2603.25026】CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering

链接：https://arxiv.org/abs/2603.25026

作者：Xu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：artifact-corrupted clinical scans, offer limited control, Medical image restoration, prior-driven enhancement, essential for improving

备注：

点击查看摘要

Abstract:Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.

115. 【2603.25021】VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

链接：https://arxiv.org/abs/2603.25021

作者：Zhe Gao,Shiyu Shen,Taifeng Chai,Weinong Wang,Haotian Xu,Xing W,Wenbin Li,Qi Fan,Yang Gao,Dacheng Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing Multimodal Large, Large Language Models, Multimodal Large Language, Existing Multimodal, Language Models

备注：

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

116. 【2603.25020】GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

链接：https://arxiv.org/abs/2603.25020

作者：Zhangyu Jin,Maksim Siniukov,Deuksin Kwon,Ashutosh Chaubey,Mohammad Soleymani

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：virtual human synthesis, Generating realistic, human synthesis, significant challenge, challenge in virtual

备注：

点击查看摘要

Abstract:Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

117. 【2603.25008】Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields

链接：https://arxiv.org/abs/2603.25008

作者：Thanh-Hai Le,Hoang-Hau Tran,Trong-Nghia Vu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：tensor based representation, FreeNeRF frequency driven, efficient tensor based, shot regularization, paper presents

备注： 11 pages, 8 figures

点击查看摘要

Abstract:This paper presents Few TensoRF, a 3D reconstruction framework that combines TensorRF's efficient tensor based representation with FreeNeRF's frequency driven few shot regularization. Using TensorRF to significantly accelerate rendering speed and introducing frequency and occlusion masks, the method improves stability and reconstruction quality under sparse input views. Experiments on the Synthesis NeRF benchmark show that Few TensoRF method improves the average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with the fine tuned version reaching 24.52 dB, while maintaining TensorRF's fast $\approx10-15$ minute training time. Experiments on the THuman 2.0 dataset further demonstrate competitive performance in human body reconstruction, achieving 27.37 - 34.00 dB with only eight input images. These results highlight Few TensoRF as an efficient and data effective solution for real-time 3D reconstruction across diverse scenes.

118. 【2603.25006】Improving Fine-Grained Rice Leaf Disease Detection via Angular-Compactness Dual Loss Learning

链接：https://arxiv.org/abs/2603.25006

作者：Md. Rokon Mia,Rakib Hossain Sajib,Abdullah Al Noman,Abir Ahmed,B M Taslimul Haque

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：staple crop supporting, Early detection, rice leaf diseases, world population, supporting a substantial

备注：

点击查看摘要

Abstract:Early detection of rice leaf diseases is critical, as rice is a staple crop supporting a substantial share of the world's population. Timely identification of these diseases enables more effective intervention and significantly reduces the risk of large-scale crop losses. However, traditional deep learning models primarily rely on cross entropy loss, which often struggles with high intra-class variance and inter-class similarity, common challenges in plant pathology datasets. To tackle this, we propose a dual-loss framework that combines Center Loss and ArcFace Loss to enhance fine-grained classification of rice leaf diseases. The method is applied into three state-of-the-art backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the public Rice Leaf Dataset. Our approach achieves significant performance gains, with accuracies of 99.6%, 99.2% and 99.2% respectively. The results demonstrate that angular margin-based and center-based constraints substantially boost the discriminative strength of feature embeddings. In particular, the framework does not require major architectural modifications, making it efficient and practical for real-world deployment in farming environments.

119. 【2603.25004】Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

链接：https://arxiv.org/abs/2603.25004

作者：Yike Wu,Necva Bolucu,Stephen Wan,Dadong Wang,Jiahao Xia,Jian Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：referring expression comprehension, task-specific training data, Zero-shot referring expression, Large Language Models, natural language queries

备注： Accepted by T-MM

点击查看摘要

Abstract:Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

120. 【2603.25000】Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method

链接：https://arxiv.org/abs/2603.25000

作者：WenXi Wang,JunQi Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reducing property loss, Rapid transit, ensure rapid transit, transit of emergency, surrounding ordinary vehicles

备注： Submitted to IEEE Transactions on Cybernetics

点击查看摘要

Abstract:Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles' safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.

121. 【2603.24994】Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting

链接：https://arxiv.org/abs/2603.24994

作者：Junoh Leea,Junmyeong Lee,Yeon-Ji Song,Inhwan Bae,Jisu Shin,Hae-Gon Jeon,Jin-Hwa Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：shown significant promise, Splatting has shown, Gaussian Splatting, significant promise, shown significant

备注： 24 pages, 7 figures

点击查看摘要

Abstract:The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose $\alpha$-blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.

122. 【2603.24992】C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance

链接：https://arxiv.org/abs/2603.24992

作者：Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：late gadolinium-enhanced MRI, remains challenging due, gadolinium-enhanced MRI, wall thickness mapping, Accurate segmentation

备注： Submitted this to the International Conference on Artificial Intelligence in Medicine (AIME 2026)

点击查看摘要

Abstract:Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall's thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.

123. 【2603.24991】owards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets

链接：https://arxiv.org/abs/2603.24991

作者：Peng Wu,Yuting Yan,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video anomaly detection, anomaly detection, video anomaly, inherent privacy-preserving properties, characterized by low

备注：

点击查看摘要

Abstract:Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.

124. 【2603.24985】Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

链接：https://arxiv.org/abs/2603.24985

作者：Yusri Al-Sanaani,Rebecca Thornhill,Pablo Nery,Elena Pena,Robert deKemp,Calum Redpath,David Birnie,Sreeraman Rajan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：magnetic resonance images, left atrial wall, wall thin geometry, late gadolinium enhancement, gadolinium enhancement magnetic

备注： Submitted to IEEE EMBC 2026

点击查看摘要

Abstract:Segmenting the left atrial wall from late gadolinium enhancement magnetic resonance images (MRI) is challenging due to the wall's thin geometry, low contrast, and the scarcity of expert annotations. We propose a Model-Agnostic Meta-Learning (MAML) framework for K-shot (K = 5, 10, 20) 3D left atrial wall segmentation that is meta-trained on the wall task together with auxiliary left atrial and right atrial cavity tasks and uses a boundary-aware composite loss to emphasize thin-structure accuracy. We evaluated MAML segmentation performance on a hold-out test set and assessed robustness under an unseen synthetic shift and on a distinct local cohort. On the hold-out test set, MAML appeared to improve segmentation performance compared to the supervised fine-tuning model, achieving a Dice score (DSC) of 0.64 vs. 0.52 and HD95 of 5.70 vs. 7.60 mm at 5-shot, and approached the fully supervised reference at 20-shot (0.69 vs. 0.71 DSC). Under unseen shift, performance degraded but remained robust: at 5-shot, MAML attained 0.59 DSC and 5.99 mm HD95 on the unseen domain shift and 0.57 DSC and 6.01 mm HD95 on the local cohort, with consistent gains as K increased. These results suggest that more accurate and reliable thin-wall boundaries are achievable in low-shot adaptation, potentially enabling clinical translation with minimal additional labeling for the assessment of atrial remodeling.

125. 【2603.24984】MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

链接：https://arxiv.org/abs/2603.24984

作者：Dohwan Ko,Jinyoung Park,Seoung Choi,Sanghyeok Lee,Seohyun Lee,Hyunwoo J. Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high model capacity, preserving high model, overhead of Transformer, Transformer architectures, effective approach

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

126. 【2603.24969】PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration

链接：https://arxiv.org/abs/2603.24969

作者：Yilin Ni,Wenjie Li,Zhengxue Wang,Juncheng Li,Guangwei Gao,Jian Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low light suffer, light suffer multiple, suffer multiple degradations-low, real-world low light, multiple degradations-low illumination

备注：

点击查看摘要

Abstract:Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.

127. 【2603.24965】Self-Corrected Image Generation with Explainable Latent Rewards

链接：https://arxiv.org/abs/2603.24965

作者：Yinyi Luo,Hrishikesh Gokhale,Marios Savvides,Jindong Wang,Shengfeng He

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：prompts remains challenging, complex prompts remains, remains challenging, spatial relations, significant progress

备注： CVPR 2026

点击查看摘要

Abstract:Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at this https URL.

128. 【2603.24961】Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

链接：https://arxiv.org/abs/2603.24961

作者：Dingjie Song,Tianlong Xu,Yi-Fan Zhang,Hang Li,Zhiling Yan,Xing Fan,Haoyang Li,Lichao Sun,Qingsong Wen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：presents unique challenges, unique challenges due, Assessing student handwritten, personalized educational feedback, varied problem-solving approaches

备注： Accepted by the 27th International Conference on Artificial Intelligence in Education (AIED'26)

点击查看摘要

129. 【2603.24953】Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

链接：https://arxiv.org/abs/2603.24953

作者：ZeBin Ji,Yang Hu,Xiuli Bi,Bo Liu,Bin Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：neural network, neural network decision-making, neural network decisions, understanding neural network, neuron

备注： Accepted in CVPR 2026

点击查看摘要

Abstract:It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network's decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron's well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.

130. 【2603.24942】BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

链接：https://arxiv.org/abs/2603.24942

作者：Yasong Dai,Zeeshan Hayder,David Ahmedt-Aristizabal,Hongdong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated strong capabilities, progressively removing noise, Recent diffusion, flow matching, demonstrated strong

备注： Accepted in CVPR2026

点击查看摘要

Abstract:Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

131. 【2603.24941】Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

链接：https://arxiv.org/abs/2603.24941

作者：Peiju Liu,Jinming Liu,Xipeng Qiu,Xuanjing Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：significant inference latency, inference latency due, processing dense visual, dense visual tokens, models excel

备注： 10 pages, 7 figures, preprint

点击查看摘要

132. 【2603.24938】Infinite Gaze Generation for Videos with Autoregressive Diffusion

链接：https://arxiv.org/abs/2603.24938

作者：Jenna Kang,Colin Groth,Tong Wu,Finley Torrens,Patsorn Sangkloy,Gordon Wetzstein,Qi Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Predicting human gaze, advancing scene understanding, Predicting human, multimodal interaction, fundamental to advancing

备注：

点击查看摘要

Abstract:Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

133. 【2603.24936】IGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization

链接：https://arxiv.org/abs/2603.24936

作者：Xuepeng Jing,Wenhuan Lu,Hao Meng,Zhizhi Yu,Jianguo Wei

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：visually complex environments, Human trajectory forecasting, intelligent multimedia systems, multimedia systems operating, Human trajectory

备注：

点击查看摘要

Abstract:Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.

134. 【2603.24934】CVA: Context-aware Video-text Alignment for Video Temporal Grounding

链接：https://arxiv.org/abs/2603.24934

作者：Sungho Moon,Seunghun Lee,Jiwan Seo,Sunghoon Im

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Context-aware Video-text Alignment, sensitive video-text alignment, achieving temporally sensitive, propose Context-aware Video-text, temporally sensitive video-text

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

135. 【2603.24912】ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects

链接：https://arxiv.org/abs/2603.24912

作者：Jing Yang,Krithika Dharanikota,Emily Jia,Haiwei Chen,Yajie Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reflect light remains, Accurately modeling, measured reflectance data, real measured reflectance, materials reflect light

备注： CVPR 2026

点击查看摘要

Abstract:Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions-multiview, multi-illumination, polarization, reflectance separation, and material attributes-yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes. Project page: this https URL

136. 【2603.24903】Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data

链接：https://arxiv.org/abs/2603.24903

作者：Haresh Rengaraj Rajamohan,Yuxuan Chen,Kyunghyun Cho,Cem M. Deniz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：SSL, improves knee osteoarthritis, SSL pretrained, study assesses, assesses whether self-supervised

备注：

点击查看摘要

Abstract:This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution

137. 【2603.24897】SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform

链接：https://arxiv.org/abs/2603.24897

作者：Yan Meng,Jack Cook,X.Y. Han,Kaan Duman,Shauna Otto,Dhiraj Pangal,Jonathan Chainey,Ruth Lau,Margaux Masson-Forsythe,Daniel A. Donoho,Danielle Levy,Gabriel Zada,Sébastien Froelich,Juan Fernandez-Miranda,Mike Chang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：supporting intraoperative decision-making, Accurate surgical phase, analyzing procedural workflows, Accurate surgical, supporting intraoperative

备注：

点击查看摘要

Abstract:Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90\% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.24897 [cs.CV]

(or
arXiv:2603.24897v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.24897

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

138. 【2603.24876】OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

链接：https://arxiv.org/abs/2603.24876

作者：Xiaoyu Tang,Jun Dong,Jintao Cheng,Rui Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote sensing visual, remote sensing images, sensing visual grounding, natural language expressions, Remote sensing

备注：

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.

139. 【2603.24866】How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

链接：https://arxiv.org/abs/2603.24866

作者：Luyu Yang,Yutong Dai,An Yan,Viraj Prabhu,Ran Xu,Zeyuan Chen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：governed by rigorous, procedural constraints, physical world, physical, physical generative reasoning

备注：

点击查看摘要

140. 【2603.24857】AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

链接：https://arxiv.org/abs/2603.24857

作者：Zhenyi Wang,Siyu Luan

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：rightarrow, machine learning, systems expand, scale and functionality, increasingly complex

备注： Published at Transactions on Machine Learning Research (TMLR)

点击查看摘要

141. 【2603.24850】owards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration

链接：https://arxiv.org/abs/2603.24850

作者：Lukas Kratochvila,Jakub Stefansky,Simon Bilik,Robert Rous,Tomas Zemcik,Michal Wolny,Frantisek Rusnak,Ondrej Cech,Karel Horak

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Fire safety consists, complex pipeline, topic of concern, Fire safety, safety consists

备注：

点击查看摘要

Abstract:Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.

142. 【2603.24849】Gaze patterns predict preference and confidence in pairwise AI image evaluation

链接：https://arxiv.org/abs/2603.24849

作者：Nikolas Papadopoulos,Shreenithi Navaneethan,Sheng Bai,Ankur Samanta,Paul Sajda

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：Direct Preference Optimization, Preference learning methods, Human Feedback, pairwise human judgments, Reinforcement Learning

备注： This paper has been accepted to ACM ETRA 2026

点击查看摘要

Abstract:Preference learning methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on pairwise human judgments, yet little is known about the cognitive processes underlying these judgments. We investigate whether eye-tracking can reveal preference formation during pairwise AI-generated image evaluation. Thirty participants completed 1,800 trials while their gaze was recorded. We replicated the gaze cascade effect, with gaze shifting toward chosen images approximately one second before the decision. Cascade dynamics were consistent across confidence levels. Gaze features predicted binary choice (68% accuracy), with chosen images receiving more dwell time, fixations, and revisits. Gaze transitions distinguished high-confidence from uncertain decisions (66% accuracy), with low-confidence trials showing more image switches per second. These results show that gaze patterns predict both choice and confidence in pairwise image evaluations, suggesting that eye-tracking provides implicit signals relevant to the quality of preference annotations.

143. 【2603.24847】CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment

链接：https://arxiv.org/abs/2603.24847

作者：Jinkui Hao,Gorkem Durak,Halil Ertugrul Aktas,Ulas Bagci,Bradley D. Allen,Nilay S. Shah,Bo Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computed tomography angiography, coronary computed tomography, cardiovascular mortality worldwide, mortality worldwide, tomography angiography

备注：

点击查看摘要

Abstract:Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29\% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.

144. 【2603.24846】NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

链接：https://arxiv.org/abs/2603.24846

作者：Katarina Trojachanec Dineva,Stefan Andonov,Ilinka Ivanoska,Ivan Kitanovski,Sasho Gramatikov,Tamara Kostova,Monika Simjanoska Misheva,Kostadin Mishev

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：image-based decision support, Recent advances, large language models, decision support, language models enable

备注： 53 pages, 12 figures. Manuscript submitted to the BMC Medical Informatics and Decision Making journal

点击查看摘要

Abstract:Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.

145. 【2603.24836】WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

链接：https://arxiv.org/abs/2603.24836

作者：Yihan Wang,Jia Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：effective warping-based method, stereo matching, simple and effective, effective warping-based, KITTI and Middlebury

备注：

点击查看摘要

Abstract:We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D, KITTI and Middlebury public benchmarks, reducing the zero-shot error by 81% on ETH3D benchmark, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at this https URL.

146. 【2603.24835】DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation

链接：https://arxiv.org/abs/2603.24835

作者：Junyi Ouyang,Wenbin Teng,Gonglin Chen,Yajie Zhao,Haiwei Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：world modeling primarily, modeling primarily due, Long-trajectory video generation, existing video diffusion, video diffusion models

备注： 29 pages, 11 figures. Project page: [this https URL](https://junyiouy.github.io/projects/dcarl)

点击查看摘要

Abstract:Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.

147. 【2603.24821】Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting

链接：https://arxiv.org/abs/2603.24821

作者：Alabi Mehzabin Anisha,Guangjing Wang,Sriram Chellappan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：counting and localization, localization are primarily, primarily modeled, point regression, density map

备注： Accepted at CVPR 2026 Main Conference

点击查看摘要

Abstract:State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at this https URL

148. 【2603.24815】Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators

链接：https://arxiv.org/abs/2603.24815

作者：Yubo Wang,Marie Fridberg,Anirejuoritse Bafor,Ole Rahbek,Christopher Iobst,Søren Vedding Kold,Ming Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Pin sites represent, Pin sites, pin site infections, pin site, external environment passes

备注：

点击查看摘要

Abstract:Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.

149. 【2603.24804】GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

链接：https://arxiv.org/abs/2603.24804

作者：Deen Dayal Mohan,Hossein Souri,Vitali Petsiuk,Juhong Min,Gopal Sharma,Luowei Zhou,Suren Kumar

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large-scale vision-language models, billion-sample datasets, posing a significant, barrier to progress, success of large-scale

备注：

点击查看摘要

Abstract:Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: this https URL.

150. 【2603.24801】Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

链接：https://arxiv.org/abs/2603.24801

作者：Abu Noman Md Sakib,Merjulah Roby,Zijie Zhang,Satish Muluk,Mark K. Eskandari,Ender A. Finol

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Computed tomography image, abdominal aortic aneurysms, models assign internal, Computed tomography, assign internal focus

备注：

点击查看摘要

Abstract:Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

151. 【2603.24800】Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

链接：https://arxiv.org/abs/2603.24800

作者：Danil Tokhchukov,Aysel Mirzoeva,Andrey Kuznetsov,Konstantin Sobolev

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Transformers, potential of Diffusion, enhance generative tasks, uncover the hidden, hidden potential

备注： Accepted to CVRP 2026, Project page: [this https URL](https://v-gen-ai.github.io/Calibri-page/)

点击查看摘要

Abstract:In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.

152. 【2603.24793】AVControl: Efficient Framework for Training Audio-Visual Controls

链接：https://arxiv.org/abs/2603.24793

作者：Matan Ben-Yosef,Tavi Halperin,Naomi Ken Korem,Mohammad Salama,Harel Cain,Asaf Joseph,Anthony Chen,Urska Jelercic,Ofir Bibi

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词：introduce costly architectural, audio transformations, trajectories and audio, Controlling video, single monolithic model

备注： Project page: [this https URL](https://matanby.github.io/AVControl/)

点击查看摘要

Abstract:Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

153. 【2603.24770】DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects

链接：https://arxiv.org/abs/2603.24770

作者：Narek Tumanyan,Samuel Rota Bulò,Denis Rozumny,Lorenzo Porzi,Adam Harley,Tali Dekel,Peter Kontschieder,Jonathon Luiten

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recent remarkable progress, Dynamic scene reconstruction, remarkable progress, reconstruction from casual, casual videos

备注： Project page: [this https URL](https://drops-dynamics.github.io/)

点击查看摘要

Abstract:Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.

154. 【2603.24764】Synthetic Cardiac MRI Image Generation using Deep Generative Models

链接：https://arxiv.org/abs/2603.24764

作者：Ishan Kumarasinghe,Dasuni Kawya,Madhura Edirisooriya,Isuri Devindi,Isuru Nawinne,Vajira Thambawita

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Synthetic cardiac MRI, annotated medical imaging, cardiac MRI, medical imaging data, promising strategy

备注： 12 pages, 2 figures, Preprint

点击查看摘要

Abstract:Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.

155. 【2603.24753】Light Cones For Vision: Simple Causal Priors For Visual Hierarchy

链接：https://arxiv.org/abs/2603.24753

作者：Manglam Kartik,Neel Tushar Shah

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Standard vision models, Standard vision, vision models treat, Worldline Slot Attention, unable to capture

备注： ICLR GRaM Workshop 2026

点击查看摘要

Abstract:Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: this https URL.

156. 【2603.24749】IGeR: A Unified Framework for Time, Images and Geo-location Retrieval

链接：https://arxiv.org/abs/2603.24749

作者：David G. Shatwell,Sirnam Swetha,Mubarak Shah

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：environmental analysis require, analysis require jointly, require jointly reasoning, urban monitoring, digital forensics

备注： Accepted in CVPR 2026

点击查看摘要

Abstract:Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

157. 【2603.24733】OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video

链接：https://arxiv.org/abs/2603.24733

作者：Selim Gilon,Emily Y. Miller,Scott D. Uhlrich

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)

关键词：Quantifying human movement, OpenCap Monocular, transform prediction, mobility-related conditions, monitoring of mobility-related

备注：

点击查看摘要

Abstract:Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (this https URL), enabling free, accessible single-smartphone biomechanical assessments.

158. 【2603.24730】A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception

链接：https://arxiv.org/abs/2603.24730

作者：Yuqi Hu,Vasha DuTell,Ahna R. Girshick,Jennifer E. Corbett

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：classic duck-rabbit illusion, duck-rabbit illusion reveals, classic duck-rabbit, duck-rabbit illusion, illusion reveals

备注：

点击查看摘要

Abstract:The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.

159. 【2603.24725】Confidence-Based Mesh Extraction from 3D Gaussians

链接：https://arxiv.org/abs/2603.24725

作者：Lukas Radl,Felix Windisch,Andreas Kurz,Thomas Köhler,Michael Steiner,Markus Steinberger

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：Gaussian Splatting, fast software rasterization, posed images due, greatly accelerated mesh, accelerated mesh extraction

备注： Project Page: [this https URL](https://r4dl.github.io/CoMe/)

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.

160. 【2603.24724】Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

链接：https://arxiv.org/abs/2603.24724

作者：Daniele Agostinelli,Thomas Agostinelli,Andrea Generosi,Maura Mengoni

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Convolutional Neural Networks, deep Convolutional Neural, Appearance-based gaze estimation, deep Convolutional, estimation frequently relies

备注：

点击查看摘要

Abstract:Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at this https URL.

161. 【2603.24721】Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models

链接：https://arxiv.org/abs/2603.24721

作者：Shengli Zhou,Minghang Zheng,Feng Zheng,Yang Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：intelligent embodied agents, developing intelligent embodied, locating target objects, target objects based, Spatial reasoning focuses

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at this https URL.

162. 【2603.24716】Accurate Point Measurement in 3DGS -- A New Alternative to Traditional Stereoscopic-View Based Measurements

链接：https://arxiv.org/abs/2603.24716

作者：Deyan Deng,Rongjun Qin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, revolutionized real-time rendering, measurement remains underutilized, geometric measurement remains, remains underutilized

备注： Accepted to the 2026 ISPRS Congress

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized real-time rendering with its state-of-the-art novel view synthesis, but its utility for accurate geometric measurement remains underutilized. Compared to multi-view stereo (MVS) point clouds or meshes, 3DGS rendered views present superior visual quality and completeness. However, current point measurement methods still rely on demanding stereoscopic workstations or direct picking on often-incomplete and inaccurate 3D meshes. As a novel view synthesizer, 3DGS renders exact source views and smoothly interpolates in-between views. This allows users to intuitively pick congruent points across different views while operating 3DGS models. By triangulating these congruent points, one can precisely generate 3D point measurements. This approach mimics traditional stereoscopic measurement but is significantly less demanding: it requires neither a stereo workstation nor specialized operator stereoscopic capability. Furthermore, it enables multi-view intersection (more than two views) for higher measurement accuracy. We implemented a web-based application to demonstrate this proof-of-concept (PoC). Using several UAV aerial datasets, we show this PoC allows users to successfully perform highly accurate point measurements, achieving accuracy matching or exceeding traditional stereoscopic methods on standard hardware. Specifically, our approach significantly outperforms direct mesh-based measurements. Quantitatively, our method achieves RMSEs in the 1-2 cm range on well-defined points. More critically, on challenging thin structures where mesh-based RMSE was 0.062 m, our method achieved 0.037 m. On sharp corners poorly reconstructed in the mesh, our method successfully measured all points with a 0.013 m RMSE, whereas the mesh method failed entirely. Code is available at: this https URL.

163. 【2603.24713】Lookalike3D: Seeing Double in 3D

链接：https://arxiv.org/abs/2603.24713

作者：Chandan Yeshwanth,Angela Dai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce impressive results, generation methods produce, methods produce impressive, impressive results, understanding and generation

备注： Project page: [this https URL](https://cy94.github.io/lookalike3d/) , Video: [this https URL](https://www.youtube.com/watch?v=g6S7J0y_52U)

点击查看摘要

Abstract:3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.

164. 【2603.24696】LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

链接：https://arxiv.org/abs/2603.24696

作者：Gokce Inal,Pouyan Navard,Alper Yilmaz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely unexplored, science remains largely, planetary science remains, enabled joint reasoning, textual information

备注： Accepted in AI4Space Workshop CVPR2026. Website: [this https URL](https://osupcvlab.github.io/LLaVA-LE/) , Dataset: [this https URL](https://huggingface.co/datasets/pcvlab/lucid)

点击查看摘要

Abstract:Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at this https URL.

165. 【2603.24695】Amplified Patch-Level Differential Privacy for Free via Random Cropping

链接：https://arxiv.org/abs/2603.24695

作者：Kaan Durmaz,Jan Schuchardt,Sebastian Schmidt,Stephan Günnemann

类目：Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：private machine learning, machine learning models, Random cropping, common data augmentation, data augmentation techniques

备注： Published at TMLR

点击查看摘要

Abstract:Random cropping is one of the most common data augmentation techniques in computer vision, yet the role of its inherent randomness in training differentially private machine learning models has thus far gone unexplored. We observe that when sensitive content in an image is spatially localized, such as a face or license plate, random cropping can probabilistically exclude that content from the model's input. This introduces a third source of stochasticity in differentially private training with stochastic gradient descent, in addition to gradient noise and minibatch sampling. This additional randomness amplifies differential privacy without requiring changes to model architecture or training procedure. We formalize this effect by introducing a patch-level neighboring relation for vision data and deriving tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping. Our analysis quantifies the patch inclusion probability and shows how it composes with minibatch sampling to yield a lower effective sampling rate. Empirically, we validate that patch-level amplification improves the privacy-utility trade-off across multiple segmentation architectures and datasets. Our results demonstrate that aligning privacy accounting with domain structure and additional existing sources of randomness can yield stronger guarantees at no additional cost.

166. 【2603.24691】BCMDA: Bidirectional Correlation Maps Domain Adaptation for Mixed Domain Semi-Supervised Medical Image Segmentation

链接：https://arxiv.org/abs/2603.24691

作者：Bentao Song,Jun Huang,Qingfeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image segmentation, semi-supervised medical image, achieving superior performance, domain semi-supervised medical, achieving superior

备注： Accepted at Neural Networks

点击查看摘要

Abstract:In mixed domain semi-supervised medical image segmentation (MiDSS), achieving superior performance under domain shift and limited annotations is challenging. This scenario presents two primary issues: (1) distributional differences between labeled and unlabeled data hinder effective knowledge transfer, and (2) inefficient learning from unlabeled data causes severe confirmation bias. In this paper, we propose the bidirectional correlation maps domain adaptation (BCMDA) framework to overcome these issues. On the one hand, we employ knowledge transfer via virtual domain bridging (KTVDB) to facilitate cross-domain learning. First, to construct a distribution-aligned virtual domain, we leverage bidirectional correlation maps between labeled and unlabeled data to synthesize both labeled and unlabeled images, which are then mixed with the original images to generate virtual images using two strategies, a fixed ratio and a progressive dynamic MixUp. Next, dual bidirectional CutMix is used to enable initial knowledge transfer within the fixed virtual domain and gradual knowledge transfer from the dynamically transitioning labeled domain to the real unlabeled domains. On the other hand, to alleviate confirmation bias, we adopt prototypical alignment and pseudo label correction (PAPLC), which utilizes learnable prototype cosine similarity classifiers for bidirectional prototype alignment between the virtual and real domains, yielding smoother and more compact feature representations. Finally, we use prototypical pseudo label correction to generate more reliable pseudo labels. Empirical evaluations on three public multi-domain datasets demonstrate the superiority of our method, particularly showing excellent performance even with very limited labeled samples. Code available at this https URL.

167. 【2603.24690】UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

链接：https://arxiv.org/abs/2603.24690

作者：Yicheng Xu,Jiangning Zhang,Zhucun Xue,Teng Hu,Ran Yi,Xiaobin Hu,Yong Liu,Dacheng Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Learning enables training-free, In-context Learning enables, In-context Learning, remains highly sensitive, enables training-free adaptation

备注： ECCV2026 under review

点击查看摘要

Abstract:In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at this https URL.

168. 【2603.24684】KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

链接：https://arxiv.org/abs/2603.24684

作者：Quanyun Wu,Kyle Gao,Daniel Long,David A. Clausi,Jonathan Li,Yuhao Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Embodied AI training, evaluation require object-centric, semantic grounding, global point clouds, training and evaluation

备注：

点击查看摘要

Abstract:Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.

169. 【2603.24680】ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

链接：https://arxiv.org/abs/2603.24680

作者：An Yu,Ting Yu Tsai,Zhenfei Zhang,Weiheng Lu,Felix X.-F. Ye,Ming-Ching Chang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent multimodal large, multimodal large language, large language models, Recent multimodal, expensive because Transformers

备注：

点击查看摘要

Abstract:Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at this https URL.

170. 【2603.24653】From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

链接：https://arxiv.org/abs/2603.24653

作者：Francesco Gentile,Nicola Dall'Asen,Francesco Tonini,Massimiliano Mancini,Lorenzo Vaquero,Elisa Ricci

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deployed at scale, understanding their internal, increasingly critical, internal mechanisms, mechanisms becomes increasingly

备注： Accepted @ CVPR 2026. Project page: [this https URL](https://frangente.github.io/SITH/)

点击查看摘要

Abstract:As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.

171. 【2603.24649】MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

链接：https://arxiv.org/abs/2603.24649

作者：Weixiang Shen,Yanzhu Hu,Che Liu,Junde Wu,Jiayuan Zhu,Chengzhi Shen,Min Xu,Yueming Jin,Benedikt Wiestler,Daniel Rueckert,Jiazhen Pan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：evaluating vision-language models, demand significant manual, significant manual labor, oversimplifies clinical reality, evaluating vision-language

备注： 11 pages, 2 figures

点击查看摘要

Abstract:Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.

172. 【2603.25645】Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

链接：https://arxiv.org/abs/2603.25645

作者：Abdullah Hamdi,Changchun Yang,Xin Gao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：colon cancer prevention, Large Language Models, Multimodal Large Language, Early screening, modern Multimodal Large

备注： preprint

点击查看摘要

Abstract:Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at this https URL .