本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新487篇论文，其中：

自然语言处理55篇
信息检索6篇
计算机视觉138篇

自然语言处理

1. 【2603.26664】Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

作者：Mo Li,L.H. Xu,Qitai Tan,Ting Cao,Yunxin Liu

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：Large language model, real maintainers reject, achieve impressive results, Large language, agents achieve impressive

备注： Preprint. Work in progress

点击查看摘要

Abstract:Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject. The root cause is not functional incorrectness but a lack of organicity: generated code ignores project-specific conventions, duplicates functionality already provided by internal APIs, and violates implicit architectural constraints accumulated over years of development. Simply exposing an agent to the latest repository snapshot is not enough: the snapshot reveals the final state of the codebase, but not the repository-specific change patterns by which that state was reached. We introduce Learning to Commit, a framework that closes this gap through Online Repository Memory. Given a repository with a strict chronological split, the agent performs supervised contrastive reflection on earlier commits: it blindly attempts to resolve each historical issue, compares its prediction against the oracle diff, and distils the gap into a continuously growing set of skills-reusable patterns capturing coding style, internal API usage, and architectural invariants. When a new PR description arrives, the agent conditions its generation on these accumulated skills, producing changes grounded in the project's own evolution rather than generic pretraining priors. Evaluation is conducted on genuinely future, merged pull requests that could not have been seen during the skill-building phase, and spans multiple dimensions including functional correctness, code-style consistency, internal API reuse rate, and modified-region plausibility. Experiments on an expert-maintained repository with rich commit history show that Online Repository Memory effectively improves organicity scores on held-out future tasks.

2. 【2603.26663】Weight Tying Biases Token Embeddings Towards the Output Space

链接：https://arxiv.org/abs/2603.26663

作者：Antonio Lopardo,Avyukth Harish,Catherine Arnett,Akshat Gupta

类目：Computation and Language (cs.CL)

关键词：remains poorly understood, space remains poorly, language model design, learned embedding space, embedding space remains

备注：

点击查看摘要

Abstract:Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.

3. 【2603.26653】PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

链接：https://arxiv.org/abs/2603.26653

作者：Shaoxuan Li,Zhixuan Zhao,Hanze Deng,Zirun Ma,Shulin Tian,Zuyan Liu,Yushi Hu,Haoning Wu,Yuhao Dong,Benlin Liu,Ziwei Liu,Ranjay Krishna

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：manually annotated benchmark, manually annotated, reasoning, video reasoning, perception-centric video reasoning

备注： Project Page: [this https URL](https://perceptioncomp.github.io)

点击查看摘要

Abstract:We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

4. 【2603.26587】EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

链接：https://arxiv.org/abs/2603.26587

作者：Paul Bontempo

类目：Computation and Language (cs.CL)

关键词：English-Tamil code-switched text, code-switched text, statistical modelling, language switch frequency, paper investigates

备注： 5 pages, 2 figures

点击查看摘要

Abstract:This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.

5. 【2603.26557】MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

链接：https://arxiv.org/abs/2603.26557

作者：Joris Köster,Zixuan Liu,Siavash Khajavi,Zizhan Zheng

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, deliver strong performance, real-world services, users and sessions

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

6. 【2603.26556】When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

链接：https://arxiv.org/abs/2603.26556

作者：Juan Gabriel Kostelec,Xiang Wang,Axel Laborieux,Christos Sourmpis,Qinghai Guo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Converting a pretrained, pretrained Transformer, reducing inference costs, inference costs, efficient hybrid model

备注：

点击查看摘要

Abstract:Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.26556 [cs.CL]

(or
arXiv:2603.26556v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.26556

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

7. 【2603.26544】Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

链接：https://arxiv.org/abs/2603.26544

作者：Maria Kefala,Jeffery L. Painter,Syed Tauhid Bukhari,Maurizio Sessa

类目：Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)

关键词：reliable reference datasets, optimal signal detection, identification of optimal, lack of reliable, time-indexed reference dataset

备注： 4 Figures and 2 Tables

点击查看摘要

Abstract:Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.

8. 【2603.26539】How Open Must Language Models be to Enable Reliable Scientific Inference?

链接：https://arxiv.org/abs/2603.26539

作者：James A. Michaelov,Catherine Arnett,Tyler A. Chang,Pamela D. Rivière,Samuel M. Taylor,Cameron R. Jones,Sean Trott,Roger P. Levy,Benjamin K. Bergen,Micah Altman

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：closed impact, reliable inference, Abstract, threaten reliable inference, inference

备注：

点击查看摘要

Abstract:How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

9. 【2603.26516】ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

链接：https://arxiv.org/abs/2603.26516

作者：Inês Vieira,Inês Calvo,Iago Paulo,James Furtado,Rafael Ferreira,Diogo Tavares,Diogo Glória-Silva,David Semedo,João Magalhães

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, expand across multilingual, multilingual domains, increasingly important, Large Language

备注： PROPOR 2026 - The 17th International Conference on Computational Processing of Portuguese

点击查看摘要

Abstract:As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.

10. 【2603.26515】JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

链接：https://arxiv.org/abs/2603.26515

作者：Guangzhao Yang,Yu Pan,Shi Qiu,Ningjie Bai

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：industrial-grade Voice, Voice AI agent, remains a significant, significant challenge, challenge in industrial-grade

备注： 8 pages, in porgress

点击查看摘要

Abstract:Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.

11. 【2603.26511】AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

链接：https://arxiv.org/abs/2603.26511

作者：Afonso Simplício,Gonçalo Vinagre,Miguel Moura Ramos,Diogo Tavares,Rafael Ferreira,Giuseppe Attanasio,Duarte M. Alves,Inês Calvo,Inês Vieira,Rui Guerra,James Furtado,Beatriz Canaverde,Iago Paulo,Vasco Ramos,Diogo Glória-Silva,Miguel Faria,Marcos Treviso,Daniel Gomes,Pedro Gomes,David Semedo,André Martins,João Magalhães

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large language models, open large language, European Portuguese, language models, remains underrepresented

备注： PROPOR 2026 - The 17th International Conference on Computational Processing of Portuguese

点击查看摘要

Abstract:Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.

12. 【2603.26510】Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

链接：https://arxiv.org/abs/2603.26510

作者：Vinicius Anjos de Almeida,Sandro Saorin da Silva,Josimar Chire,Leonardo Vicenzi,Nícolas Henrique Borges,Helena Kociolek,Sarah Miriã de Castro Rocha,Frederico Nassif Gomes,Júlia Cristina Ferreira,Oge Marques,Lucas Emanuel Silva e Oliveira

类目：Computation and Language (cs.CL)

关键词：valuable unstructured information, Portuguese clinical NER, unstructured information, clinical NER, notes contain valuable

备注： Under peer review. GitHub: [this https URL](https://github.com/GRUPOMED4U/clinical_ner_benchmark_paper)

点击查看摘要

Abstract:Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.

13. 【2603.26449】ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims

链接：https://arxiv.org/abs/2603.26449

作者：Raia Abu Ahmad,Max Upravitelev,Aida Usmanova,Veronika Solopova,Georg Rehm

类目：Computation and Language (cs.CL)

关键词：Automatically verifying climate-related, verifying climate-related claims, rhetorical strategies underlying, Automatically verifying, strategies underlying climate

备注： Accepted at NSLP@LREC 2026

点击查看摘要

Abstract:Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.

14. 【2603.26434】Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

链接：https://arxiv.org/abs/2603.26434

作者：Mikko Saukkoriipi,Nicole Hernandez,Jaakko Sahlsten,Kimmo Kaski,Otso Arponen

类目：Computation and Language (cs.CL)

关键词：electronic health records, Contextual Question Answering, time-consuming and error-prone, electronic health, Clinicians

备注：

点击查看摘要

Abstract:Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.

15. 【2603.26430】Analysing Calls to Order in German Parliamentary Debates

链接：https://arxiv.org/abs/2603.26430

作者：Nina Smirnova,Daniel Dan,Philipp Mayr

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：shaping legislative outcomes, shaping legislative, public discourse, constitutes a central, legislative outcomes

备注： The paper is accepted to the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026) co-located with LREC 2026

点击查看摘要

Abstract:Parliamentary debate constitutes a central arena of political power, shaping legislative outcomes and public discourse. Incivility within this arena signals political polarization and institutional conflict. This study presents a systematic investigation of incivility in the German Bundestag by examining calls to order (CtO; plural: CtOs) as formal indicators of norm violations. Despite their relevance, CtOs have received little systematic attention in parliamentary research. We introduce a rule-based method for detecting and annotating CtOs in parliamentary speeches and present a novel dataset of German parliamentary debates spanning 72 years that includes annotated CtO instances. Additionally, we develop the first classification system for CtO triggers and analyze the factors associated with their occurrence. Our findings show that, despite formal regulations, the issuance of CtOs is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. An insult towards individuals is the most frequent cause of CtO. In general, male members and those belonging to opposition parties receive more calls to order than their female and coalition-party counterparts. Most CtO triggers were detected in speeches dedicated to governmental affairs and actions of the presidency. The CtO triggers dataset is available at: this https URL.

16. 【2603.26410】Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

链接：https://arxiv.org/abs/2603.26410

作者：Richard J. Young

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Extended-thinking models expose, Extended-thinking models, alongside the user-visible, thinking tokens, MMLU and GPQA

备注： 19 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model's thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed *thinking-answer divergence*. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most *transparent* hint, with 58.8% of sycophancy-influenced cases acknowledging the professor's authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.

17. 【2603.26401】Word Alignment-Based Evaluation of Uniform Meaning Representations

链接：https://arxiv.org/abs/2603.26401

作者：Daniel Zeman,Federica Gamba

类目：Computation and Language (cs.CL)

关键词：Uniform Meaning Representations, challenge because competing, Meaning Representations, graph-based representations, competing representations

备注：

点击查看摘要

Abstract:Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to each other. Existing approaches favor node mapping that maximizes $F_1$ score over node relations and attributes, regardless whether the similarity is intentional or accidental; consequently, the identified mismatches in values of node attributes are not useful for any detailed error analysis. We propose a node-matching algorithm that allows comparison of multiple Uniform Meaning Representations (UMR) of one sentence and that takes advantage of node-word alignments, inherently available in UMR. We compare it with previously used approaches, in particular smatch (the de-facto standard in AMR evaluation), and argue that sensitivity to word alignment makes the comparison of meaning representations more intuitive and interpretable, while avoiding the NP-hard search problem inherent in smatch. A script implementing the method is freely available.

18. 【2603.26380】Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

链接：https://arxiv.org/abs/2603.26380

作者：Yusheng Zhao,Hourun Li,Bohan Wu,Jingyang Yuan,Meng Zhang,Yichun Yin,Lifeng Shang,Ming Zhang

类目：Computation and Language (cs.CL)

关键词：Sliding window attention, core component, component in modern, attention, full attention

备注：

点击查看摘要

Abstract:The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.

19. 【2603.26363】A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models

链接：https://arxiv.org/abs/2603.26363

作者：Steffen Herbold,Florian Lemmerich

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, texts using Large, inherently uncertain

备注：

点击查看摘要

Abstract:The generation of texts using Large Language Models (LLMs) is inherently uncertain, with sources of uncertainty being not only the generation of texts, but also the prompt used and the downstream interpretation. Within this work, we provide a formal framework for the measurement of uncertainty that takes these different aspects into account. Our framework models prompting, generation, and interpretation as interconnected autoregressive processes that can be combined into a single sampling tree. We introduce filters and objective functions to describe how different aspects of uncertainty can be expressed over the sampling tree and demonstrate how to express existing approaches towards uncertainty through these functions. With our framework we show not only how different methods are formally related and can be reduced to a common core, but also point out additional aspects of uncertainty that have not yet been studied.

20. 【2603.26332】CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law

链接：https://arxiv.org/abs/2603.26332

作者：JiHyeok Jung,TaeYoung Yoon,HyunSouk Cho

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Legal, Legal reasoning requires, evaluate rule application, Legal reasoning, rules operate

备注： 15 pages

点击查看摘要

Abstract:Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at this https URL.

21. 【2603.26323】From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

链接：https://arxiv.org/abs/2603.26323

作者：Jiyuan An,Liner Yang,Mengyan Wang,Luming Lu,Weihua An,Erhong Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models', increasingly important capability, reflects structured internal, benchmarks reflects structured, foundation models

备注：

点击查看摘要

Abstract:As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models' (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.

22. 【2603.26292】findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

链接：https://arxiv.org/abs/2603.26292

作者：Héctor Javier Vázquez Martínez

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：unsupervised word discovery, units offer compact, syllabification remains fragmented, Syllable-level units offer, linguistically meaningful representations

备注： 4 pages + 2 for references, disclosures acknowledgements; currently under review

点击查看摘要

Abstract:Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

23. 【2603.26259】Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

链接：https://arxiv.org/abs/2603.26259

作者：Antoine Edy,Max Conti,Quentin Macé

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：potentially hiding performance, hiding performance bottlenecks, Late Interaction models, dynamics remain understudied, underlying dynamics remain

备注： Accepted at The 1st Late Interaction Workshop (LIR) @ ECIR 2026

点击查看摘要

Abstract:While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.

24. 【2603.26253】SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia

链接：https://arxiv.org/abs/2603.26253

作者：Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja

类目：Computation and Language (cs.CL)

关键词：Indonesia is constrained, Big data research, review sites, fundamental fragmentation, social media

备注： 10 pages, 1 Figure, 4 Tables

点击查看摘要

Abstract:Big data research in Indonesia is constrained by a fundamental fragmentation: relevant data is scattered across social media, news portals, e-commerce platforms, review sites, and academic databases, each with different formats, access methods, and noise characteristics. Researchers must independently build collection pipelines, clean heterogeneous data, and assemble separate analysis tools, a process that often overshadows the research itself. We present SocialX, a modular platform for multi-source big data research that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified, source-agnostic pipeline. The platform separates concerns into three independent layers (collection, preprocessing, and analysis) connected by a lightweight job-coordination mechanism. This modularity allows each layer to grow independently: new data sources, preprocessing methods, or analysis tools can be added without modifying the existing pipeline. We describe the design principles that enable this extensibility, detail the preprocessing methodology that addresses challenges specific to Indonesian text across registers, and demonstrate the platform's utility through a walkthrough of a typical research workflow. SocialX is publicly accessible as a web-based platform at this https URL.

25. 【2603.26248】Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan

链接：https://arxiv.org/abs/2603.26248

作者：Chihiro Taguchi,Yukinori Takubo,David Chiang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：linguistic diversity worldwide, Language endangerment poses, diversity worldwide, endangerment poses, poses a major

备注： 9 pages, 4 tables, 4 figures, accepted at LREC 2026

点击查看摘要

Abstract:Language endangerment poses a major challenge to linguistic diversity worldwide, and technological advances have opened new avenues for documentation and revitalization. Among these, automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. We present an ongoing effort to develop an ASR system for Ikema based on field recordings. Specifically, we (1) construct a {\totaldatasethours}-hour speech corpus from field recordings, (2) train an ASR model that achieves a character error rate as low as 15\%, and (3) evaluate the impact of ASR assistance on the efficiency of speech transcription. Our results demonstrate that ASR integration can substantially reduce transcription time and cognitive load, offering a practical pathway toward scalable, technology-supported documentation of endangered languages.

26. 【2603.26246】Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

链接：https://arxiv.org/abs/2603.26246

作者：Shashi Kumar,Esaú Villatoro-Tello,Sergio Burdisso,Kadri Hacioglu,Thibault Bañeras-Roux,Hasindri Watawana,Dairazalia Sanchez-Cortes,Srikanth Madikeri,Petr Motlicek,Andreas Stolcke

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

关键词：Standard LLM-based speech, systems typically process, typically process utterances, speech recognition systems, recognition systems typically

备注： 11 pages

点击查看摘要

Abstract:Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

27. 【2603.26236】A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs

链接：https://arxiv.org/abs/2603.26236

作者：Uri Z. Kialy,Avi Shtarkberg,Ayal Klein

类目：Computation and Language (cs.CL)

关键词：abstract concepts, process culture-specific pragmatic, isolated language-specific memorizations, successfully transfer factual, factual and syntactic

备注：

点击查看摘要

Abstract:While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic register processing from trivial lexical sensitivity, we introduce a novel dataset in which every target term is polysemous, appearing in both literal and informal contexts. We find that while much of the informal-register signal is distributed across language-specific features, a small but highly robust cross-linguistic core consistently emerges. This shared core forms a geometrically coherent ``informal register subspace'' that sharpens in the model's deeper layers. Crucially, these shared representations are not merely correlational: activation steering with these features causally shifts output formality across all source languages and transfers zero-shot to six unseen languages spanning diverse language families and scripts. Together, these results provide the first mechanistic evidence that multilingual LLMs internalize informal register not just as surface-level heuristics, but as a portable, language-agnostic pragmatic abstraction.

28. 【2603.26235】GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation

链接：https://arxiv.org/abs/2603.26235

作者：Beatrice Alex,Claire Grover,Arlene Casey,Richard Tobin,Heather Whalley,William Whiteley

类目：Computation and Language (cs.CL)

关键词：Generation Scotland cohort, brain radiology reports, Generation Scotland, brain disease phenotypes, Scotland cohort

备注： 11 pages, 1 figure

点击查看摘要

Abstract:We present GS-BrainText, a curated dataset of 8,511 brain radiology reports from the Generation Scotland cohort, of which 2,431 are annotated for 24 brain disease phenotypes. This multi-site dataset spans five Scottish NHS health boards and includes broad age representation (mean age 58, median age 53), making it uniquely valuable for developing and evaluating generalisable clinical natural language processing (NLP) algorithms and tools. Expert annotations were performed by a multidisciplinary clinical team using an annotation schema, with 10-100% double annotation per NHS health board and rigorous quality assurance. Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups (F1: 87.01-98.13), highlighting critical challenges in generalisation of NLP tools. The GS-BrainText dataset addresses a significant gap in available UK clinical text resources and provides a valuable resource for the study of linguistic variation, diagnostic uncertainty expression and the impact of data characteristics on NLP system performance.

29. 【2603.26233】Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

链接：https://arxiv.org/abs/2603.26233

作者：Nicholas Edwards,Sebastian Schuster

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, lack crucial context, frequently encounter underspecified, software engineering

备注：

点击查看摘要

Abstract:As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

30. 【2603.26207】Sparse Auto-Encoders and Holism about Large Language Models

链接：https://arxiv.org/abs/2603.26207

作者：Jumbly Grindrod

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Model, Language Model, Large Language, words and complex, Grindrod

备注：

点击查看摘要

Abstract:Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).

31. 【2603.26182】ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

链接：https://arxiv.org/abs/2603.26182

作者：Zhuohan Ge,Haoyang Li,Yubo Wang,Nicole Hu,Chen Jason Zhang,Qing Li

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, non-linear reasoning required, accurate clinical diagnosis

备注： 16 pages, 1 figure, 6 tables, conference

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.

32. 【2603.26164】DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

链接：https://arxiv.org/abs/2603.26164

作者：Hao Liang,Zhengyang Zhao,Meiyi Qiang,Mingrui Chen,Lu Ma,Rongyi Yu,Hengyi Feng,Shixuan Sun,Zimo Meng,Xiaochen Ma,Xuanlin Yang,Qifeng Cai,Ruichuan An,Bohan Zeng,Zhen Hao Wong,Chengyu Shen,Runming He,Zhaoyang Han,Yaowei Zheng,Fangcheng Fu,Conghui He,Bin Cui,Zhiyu Li,Weinan E,Wentao Zhang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：large language models, improving large language, language models, model parameters, promising direction

备注：

点击查看摘要

Abstract:Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

33. 【2603.26156】Clash of the models: Comparing performance of BERT-based variants for generic news frame detection

链接：https://arxiv.org/abs/2603.26156

作者：Vihang Jumle

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：extensively applied theories, extensively applied, applied theories, frame detection, continues to remain

备注：

点击查看摘要

Abstract:Framing continues to remain one of the most extensively applied theories in political communication. Developments in computation, particularly with the introduction of transformer architecture and more so with large language models (LLMs), have naturally prompted scholars to explore various novel computational approaches, especially for deductive frame detection, in recent years. While many studies have shown that different transformer models outperform their preceding models that use bag-of-words features, the debate continues to evolve regarding how these models compare with each other on classification tasks. By placing itself at this juncture, this study makes three key contributions: First, it comparatively performs generic news frame detection and compares the performance of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT and ALBERT) to add to the debate on best practices around employing computational text analysis for political communication studies. Second, it introduces various fine-tuned models capable of robustly performing generic news frame detection. Third, building upon numerous previous studies that work with US-centric data, this study provides the scholarly community with a labelled generic news frames dataset based on the Swiss electoral context that aids in testing the contextual robustness of these computational approaches to framing analysis.

34. 【2603.26127】Finding Distributed Object-Centric Properties in Self-Supervised Transformers

链接：https://arxiv.org/abs/2603.26127

作者：Samyak Rawlekar,Amitabh Swain,Yujun Cai,Yiwei Wang,Ming-Hsuan Yang,Narendra Ahuja

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：Self-supervised Vision Transformers, Self-supervised Vision, Vision Transformers, DINO show, typically observed

备注： Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

35. 【2603.26106】LLM Benchmark-User Need Misalignment for Climate Change

链接：https://arxiv.org/abs/2603.26106

作者：Oucheng Liu,Lexing Xie,Jing Jiang

类目：Computation and Language (cs.CL)

关键词：major socio-scientific issue, socio-scientific issue shapes, issue shapes public, shapes public decision-making, policy discussions

备注： 37 pages (8 main), 31 figures, 14 tables

点击查看摘要

Abstract:Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at this https URL.

36. 【2603.26095】IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text

链接：https://arxiv.org/abs/2603.26095

作者：Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja

类目：Computation and Language (cs.CL)

关键词：natural language processing, Bahasa Indonesia, remains largely unexplored, unexplored for Bahasa, language processing

备注： 9 pages, 3 figures,6 tables

点击查看摘要

Abstract:Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.

37. 【2603.26089】Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind

链接：https://arxiv.org/abs/2603.26089

作者：Christopher Ackerman

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：agents with knowledge, guide their behavior, social world, represent oneself, universal that enables

备注： 22 pages, 13 figures, 1 table

点击查看摘要

Abstract:The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.

38. 【2603.26076】Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management

链接：https://arxiv.org/abs/2603.26076

作者：Darryl Teo,Adharsha Sam,Chuan Shen Marcus Koh,Rakesh Nagi,Nuno Antunes Ribeiro

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：extensive technical terminology, Total Airport Management, proprietary regional information, inherently complex due, rigorous regulations

备注：

点击查看摘要

Abstract:Documentation of airport operations is inherently complex due to extensive technical terminology, rigorous regulations, proprietary regional information, and fragmented communication across multiple stakeholders. The resulting data silos and semantic inconsistencies present a significant impediment to the Total Airport Management (TAM) initiative. This paper presents a methodological framework for constructing a domain-grounded, machine-readable Knowledge Graph (KG) through a dual-stage fusion of symbolic Knowledge Engineering (KE) and generative Large Language Models (LLMs). The framework employs a scaffolded fusion strategy in which expert-curated KE structures guide LLM prompts to facilitate the discovery of semantically aligned knowledge triples. We evaluate this methodology on the Google LangExtract library and investigate the impact of context window utilization by comparing localized segment-based inference with document-level processing. Contrary to prior empirical observations of long-context degradation in LLMs, document-level processing improves the recovery of non-linear procedural dependencies. To ensure the high-fidelity provenance required in airport operations, the proposed framework fuses a probabilistic model for discovery and a deterministic algorithm for anchoring every extraction to its ground source. This ensures absolute traceability and verifiability, bridging the gap between "black-box" generative outputs and the transparency required for operational tooling. Finally, we introduce an automated framework that operationalizes this pipeline to synthesize complex operational workflows from unstructured textual corpora.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.26076 [cs.AI]

(or
arXiv:2603.26076v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.26076

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

39. 【2603.26062】I Want to Believe (but the Vocabulary Changed): Measuring the Semantic Structure and Evolution of Conspiracy Theories

链接：https://arxiv.org/abs/2603.26062

作者：Manisha Keim,Sarmad Chandio,Osama Khalid,Rishab Nithyanand

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)

关键词：belief formation, largely focused, focused on belief, paying less attention, conspiracy theories

备注：

点击查看摘要

Abstract:Research on conspiracy theories has largely focused on belief formation, exposure, and diffusion, while paying less attention to how their meanings change over time. This gap persists partly because conspiracy-related terms are often treated as stable lexical markers, making it difficult to separate genuine semantic changes from surface-level vocabulary changes. In this paper, we measure the semantic structure and evolution of conspiracy theories in online political discourse. Using 169.9M comments from Reddit's r/politics subreddit spanning 2012--2022, we first demonstrate that conspiracy-related language forms coherent and semantically distinguishable regions of language space, allowing conspiracy theories to be treated as semantic objects. We then track how these objects evolve over time using aligned word embeddings, enabling comparisons of semantic neighborhoods across periods. Our analysis reveals that conspiracy theories evolve non-uniformly, exhibiting patterns of semantic stability, expansion, contraction, and replacement that are not captured by keyword-based approaches alone.

40. 【2603.26046】Retrieval-Augmented Generation Based Nurse Observation Extraction

链接：https://arxiv.org/abs/2603.26046

作者：Kyomin Hwang,Nojun Kwak

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, reducing human workload, Recent advancements

备注：

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have played a significant role in reducing human workload across various domains, a trend that is increasingly extending into the medical field. In this paper, we propose an automated pipeline designed to alleviate the burden on nurses by automatically extracting clinical observations from nurse dictations. To ensure accurate extraction, we introduce a method based on Retrieval-Augmented Generation (RAG). Our approach demonstrates effective performance, achieving an F1-score of 0.796 on the MEDIQA-SYNUR test dataset.

41. 【2603.26045】H-Node Attack and Defense in Large Language Models

链接：https://arxiv.org/abs/2603.26045

作者：Eric Yocam,Varghese Vaidyan,Yong Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词：large language models, transformer-based large language, defends hallucination representations, H-Node Adversarial Noise, Adversarial Noise Cancellation

备注： 17 pages, 7 figures, 6 tables

点击查看摘要

Abstract:We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions -- termed Hallucination Nodes (H-Nodes) -- with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.

42. 【2603.26034】AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

链接：https://arxiv.org/abs/2603.26034

作者：Wenbo Gao,Renxi Liu,Xian Wang,Fang Guo,Shuai Yang,Xi Chen,Hui-Ling Zhen,Hanting Chen,Weizhe Lin,Xiaosong Li,Yaoyuan Wang

类目：Computation and Language (cs.CL)

关键词：Autonomous agents powered, perform complex tasks, fundamental trade-off arises, large language models, Autonomous agents

备注：

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent's own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.

43. 【2603.26013】oward Culturally Grounded Natural Language Processing

链接：https://arxiv.org/abs/2603.26013

作者：Sina Bagheri Nezhad

类目：Computation and Language (cs.CL)

关键词：broader global inclusivity, global inclusivity, growing literature shows, evidence of broader, broader global

备注：

点击查看摘要

Abstract:Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020--2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.

44. 【2603.25981】Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

链接：https://arxiv.org/abs/2603.25981

作者：Amirhosein Chahe,Lifeng Zhou

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：world model planning, world model, natural language instructions, language instructions remains, remains a fundamental

备注：

点击查看摘要

Abstract:Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high-quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V-JEPA-2, across both the policy and world model components. Experiments on real-world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal-reaching accuracy and instruction-following fidelity.

45. 【2603.25975】Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics

链接：https://arxiv.org/abs/2603.25975

作者：Peter Balogh

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Schank, ATRANS, PTRANS, Minimum Description Length, CHANGE

备注：

点击查看摘要

Abstract:We show that they do. Schank's conceptual dependency theory proposed that all events decompose into primitive operations -- ATRANS, PTRANS, MTRANS, and others -- hand-coded from linguistic intuition. Can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder's wake-sleep library learning to event state transformations. Given events as before/after world state pairs, our system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping directly to Schank's: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators ("mail" = ATRANS + PTRANS) and novel emotional state operators absent from Schank's taxonomy. We validate on synthetic events and real-world commonsense data from the ATOMIC knowledge graph. On synthetic data, discovered operators achieve Bayesian MDL within 4% of Schank's hand-coded primitives while explaining 100% of events vs. Schank's 81%. On ATOMIC, results are more dramatic: Schank's primitives explain only 10% of naturalistic events, while the discovered library explains 100%. Dominant operators are not physical-action primitives but mental and emotional state changes -- CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) -- none in Schank's original taxonomy. These results provide the first empirical evidence that event primitives can be derived from compression pressure, that Schank's core primitives are information-theoretically justified, and that the complete inventory is substantially richer than proposed -- with mental/emotional operators dominating in naturalistic data.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.25975 [cs.LG]

(or
arXiv:2603.25975v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.25975

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Peter Balogh [view email] [v1]
Thu, 26 Mar 2026 23:35:39 UTC (82 KB)

46. 【2603.25973】MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

链接：https://arxiv.org/abs/2603.25973

作者：Weizhi Zhang,Xiaokai Wei,Wei-Chieh Huang,Zheng Hui,Chen Wang,Michelle Gong,Philip S. Yu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, expanded context windows, Recent advancements, short-session synthetic dialogues

备注： Published as a workshop paper in Lifelong Agent @ ICLR 2026

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.

47. 【2603.25960】When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

链接：https://arxiv.org/abs/2603.25960

作者：Binesh Sadanandan,Vahid Behzadan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, remains poorly characterized, formatting remains poorly, prompt formatting remains

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models "know" more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.

48. 【2603.25944】Can Small Models Reason About Legal Documents? A Comparative Study

链接：https://arxiv.org/abs/2603.25944

作者：Snehit Vaddi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, models show promise, frontier models raises, models raises concerns, deploying frontier models

备注： 17 pages, 9 models, 5 prompting strategies, 3 legal benchmarks, 405 experiments

点击查看摘要

Abstract:Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model's utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.

49. 【2603.25926】Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio

链接：https://arxiv.org/abs/2603.25926

作者：Yijiong Yu,Shuai Yuan,Jie Zheng,Huazheng Wang,Ji Pei

类目：Computation and Language (cs.CL)

关键词：Soft context compression, processing long contexts, encoding long context, Soft context, processing long

备注：

点击查看摘要

Abstract:Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to account for the extreme variance in natural language information density. While adopting a density-aware dynamic compression ratio seems intuitive, empirical investigations reveal that models struggle intrinsically with operations parameterized by input dependent, continuous structural hyperparameters. To resolve this pitfall, we introduce Semi-Dynamic Context Compression framework. Our approach features a Discrete Ratio Selector, which predicts a compression target based on intrinsic information density and quantizes it to a predefined set of discrete compression ratios. It is efficiently jointly trained with the compressor on synthetic data, with the summary lengths as a proxy to create labels for compression ratio prediction. Extensive evaluations confirm that our density-aware framework, utilizing mean pooling as the backbone, consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques. Our code, data and model weights are available at this https URL

50. 【2603.25862】Methods for Knowledge Graph Construction from Text Collections: Development and Applications

链接：https://arxiv.org/abs/2603.25862

作者：Vanni Zavarella

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：open access scholarly, access scholarly communications, media online interactions, unstructured textual data, Virtually every sector

备注：

点击查看摘要

Abstract:Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.

51. 【2603.25836】Gradient-Informed Training for Low-Resource Multilingual Speech Translation

链接：https://arxiv.org/abs/2603.25836

作者：Ruiyan Sun,Satoshi Nakamura

类目：Computation and Language (cs.CL)

关键词：frequently introduces representation, introduces representation conflicts, uniform architectural sharing, low-resource multilingual, uniform architectural

备注：

点击查看摘要

Abstract:In low-resource multilingual speech-to-text translation, uniform architectural sharing across languages frequently introduces representation conflicts that impede convergence. This work proposes a principled methodology to automatically determine layer-specific sharing patterns by mining training gradient information. Our approach employs three distinct analysis strategies: distance-based language clustering, self/cross-task divergence metrics for capacity allocation, and joint factorization coupled with canonical correlation analysis for subspace alignment. Extensive evaluation across four language pairs (using the SeamlessM4T-Medium architecture) demonstrates persistent improvements in translation quality metrics.

52. 【2603.25821】Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

链接：https://arxiv.org/abs/2603.25821

作者：Anna Kozlova,Stanislau Salavei,Pavel Satalkin,Hanna Plotnitskaya,Sergey Parfenyuk

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：present Doctorina MedBench, Doctorina MedBench, present Doctorina, realistic physician-patient interactions, physician-patient interactions

备注：

点击查看摘要

Abstract:We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

Cite as:
arXiv:2603.25821 [cs.CL]

(or
arXiv:2603.25821v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.25821

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

53. 【2603.25804】RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

链接：https://arxiv.org/abs/2603.25804

作者：Jiajun Zhang,Yuying Li,Zhixun Li,Xingyu Guo,Jingzhuo Wu,Leqi Zheng,Yiran Yang,Jianke Zhang,Qingbin Li,Shannan Yan,Zhetong Li,Changguo Jia,Junfei Wu,Zilei Wang,Qiang Liu,Liang Wang

类目：Computation and Language (cs.CL)

关键词：demonstrated impressive capabilities, demonstrated impressive, impressive capabilities, Vision-Language Models, Abstract

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{this https URL}.

54. 【2603.25752】Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

链接：https://arxiv.org/abs/2603.25752

作者：Ying Liu,Yuntao Shou,Wei Ai,Tao Meng,Keqin Li

类目：Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：limited acquisition conditions, real-world scenarios, acquisition conditions, resulting in extracted, subject to environmental

备注： 19 pages

点击查看摘要

Abstract:In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

55. 【2603.26494】Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models

链接：https://arxiv.org/abs/2603.26494

作者：Nathan Roll

类目：Quantum Physics (quant-ph); Computation and Language (cs.CL)

关键词：shown competitive performance, circuits exploit genuinely, genuinely quantum resources, trained quantum circuits, quantum circuits exploit

备注： 9 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Quantum language models have shown competitive performance on sequential tasks, yet whether trained quantum circuits exploit genuinely quantum resources -- or merely embed classical computation in quantum hardware -- remains unknown. Prior work has evaluated these models through endpoint metrics alone, without examining the memory strategies they actually learn internally. We introduce the first mechanistic interpretability study of quantum language models, combining causal gate ablation, entanglement tracking, and density-matrix interchange interventions on a controlled long-range dependency task. We find that single-qubit models are exactly classically simulable and converge to the same geometric strategy as matched classical baselines, while two-qubit models with entangling gates learn a representationally distinct strategy that encodes context in inter-qubit entanglement -- confirmed by three independent causal tests (p 0.0001, d = 0.89). On real quantum hardware, only the classical geometric strategy survives device noise; the entanglement strategy degrades to chance. These findings open mechanistic interpretability as a tool for the science of quantum language models and reveal a noise-expressivity tradeoff governing which learned strategies survive deployment.

信息检索

1. 【2603.26430】Analysing Calls to Order in German Parliamentary Debates

链接：https://arxiv.org/abs/2603.26430

作者：Nina Smirnova,Daniel Dan,Philipp Mayr

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：shaping legislative outcomes, shaping legislative, public discourse, constitutes a central, legislative outcomes

备注： The paper is accepted to the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026) co-located with LREC 2026

点击查看摘要

2. 【2603.26426】Demystifying Funding: Reconstructing a Unified Dataset of the UK Funding Lifecycle

链接：https://arxiv.org/abs/2603.26426

作者：William Thorne,Rupert Shepherd,Diana Maynard

类目：Computers and Society (cs.CY); Information Retrieval (cs.IR)

关键词：UKRI Gateway, UKRI research councils, UKRI funding opportunities, panel meeting outcomes, links funding opportunities

备注： Accepted at NSLP 2026

点击查看摘要

Abstract:We present a reconstruction of UKRI's Gateway to Research (GtR) database that links funding opportunities to their resulting project proposals through panel meeting outcomes. Unlike existing work that focuses primarily on funded projects and their outcomes, we close the complete funding lifecycle by integrating three previously disconnected data sources: the GtR project database, UKRI funding opportunities, and competitive funding decision records across UKRI's research councils. We describe the technical challenges of data collection, including navigating inconsistent publication formats and restricted access to panel decisions. The resulting dataset enables a holistic interrogation of the entire funding process, from opportunity announcement to research outcomes. We release the database and associated code.

3. 【2603.26259】Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

链接：https://arxiv.org/abs/2603.26259

作者：Antoine Edy,Max Conti,Quentin Macé

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：potentially hiding performance, hiding performance bottlenecks, Late Interaction models, dynamics remain understudied, underlying dynamics remain

备注： Accepted at The 1st Late Interaction Workshop (LIR) @ ECIR 2026

点击查看摘要

4. 【2603.26100】Rethinking Recommendation Paradigms: From Pipelines to Agentic Recommender Systems

链接：https://arxiv.org/abs/2603.26100

作者：Jinxin Hu,Hao Deng,Lingyu Mu,Hao Zhang,Shizhun Wang,Yu Zhang,Xiaoyi Zeng

类目：Information Retrieval (cs.IR)

关键词：Large-scale industrial recommenders, industrial recommenders typically, Large-scale industrial, large pre-trained models, fixed multi-stage pipeline

备注：

点击查看摘要

Abstract:Large-scale industrial recommenders typically use a fixed multi-stage pipeline (recall, ranking, re-ranking) and have progressed from collaborative filtering to deep and large pre-trained models. However, both multi-stage and so-called One Model designs remain essentially static: models are black boxes, and system improvement relies on manual hypotheses and engineering, which is hard to scale under heterogeneous data and multi-objective business constraints. We propose an Agentic Recommender System (AgenticRS) that reorganizes key modules as agents. Modules are promoted to agents only when they form a functionally closed loop, can be independently evaluated, and possess an evolvable decision space. For model agents, we outline two self-evolution mechanisms: reinforcement learning style optimization in well-defined action spaces, and large language model based generation and selection of new architectures and training schemes in open-ended design spaces. We further distinguish individual evolution of single agents from compositional evolution over how multiple agents are selected and connected, and use a layered inner and outer reward design to couple local optimization with global objectives. This provides a concise blueprint for turning static pipelines into self-evolving agentic recommender systems.

5. 【2603.26085】AgenticRS-Architecture: System Design for Agentic Recommender Systems

链接：https://arxiv.org/abs/2603.26085

作者：Hao Zhang,Jinxin Hu,Hao Deng,Lingyu Mu,Shizhun Wang,Yu Zhang,Xiaoyi Zeng

类目：Information Retrieval (cs.IR)

关键词：agent based architecture, based architecture, full lifecycle, lifecycle of industrial, industrial recommender systems

备注：

点击查看摘要

Abstract:AutoModel is an agent based architecture for the full lifecycle of industrial recommender systems. Instead of a fixed recall and ranking pipeline, AutoModel organizes recommendation as a set of interacting evolution agents with long term memory and self improvement capability. We instantiate three core agents along the axes of models, features, and resources: AutoTrain for model design and training, AutoFeature for data analysis and feature evolution, and AutoPerf for performance, deployment, and online experimentation. A shared coordination and knowledge layer connects these agents and records decisions, configurations, and outcomes. Through a case study of a module called paper autotrain, we show how AutoTrain automates paper driven model reproduction by closing the loop from method parsing to code generation, large scale training, and offline comparison, reducing manual effort for method transfer. AutoModel enables locally automated yet globally aligned evolution of large scale recommender systems and can be generalized to other AI systems such as search and advertising.

6. 【2603.26076】Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management

链接：https://arxiv.org/abs/2603.26076

作者：Darryl Teo,Adharsha Sam,Chuan Shen Marcus Koh,Rakesh Nagi,Nuno Antunes Ribeiro

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：extensive technical terminology, Total Airport Management, proprietary regional information, inherently complex due, rigorous regulations

备注：

点击查看摘要

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.26076 [cs.AI]

(or
arXiv:2603.26076v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.26076

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

计算机视觉

1. 【2603.26665】Detailed Geometry and Appearance from Opportunistic Motion

链接：https://arxiv.org/abs/2603.26665

作者：Ryosuke Hirai,Kohei Yamashita,Antoine Guédon,Ryo Kawahara,Vincent Lepetit,Ko Nishino

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains fundamentally constrained, broad applications, set of fixed, foundational task, task with broad

备注：

点击查看摘要

Abstract:Reconstructing 3D geometry and appearance from a sparse set of fixed cameras is a foundational task with broad applications, yet it remains fundamentally constrained by the limited viewpoints. We show that this bound can be broken by exploiting opportunistic object motion: as a person manipulates an object~(e.g., moving a chair or lifting a mug), the static cameras effectively ``orbit'' the object in its local coordinate frame, providing additional virtual viewpoints. Harnessing this object motion, however, poses two challenges: the tight coupling of object pose and geometry estimation and the complex appearance variations of a moving object under static illumination. We address these by formulating a joint pose and shape optimization using 2D Gaussian splatting with alternating minimization of 6DoF trajectories and primitive parameters, and by introducing a novel appearance model that factorizes diffuse and specular components with reflected directional probing within the spherical harmonics space. Extensive experiments on synthetic and real-world datasets with extremely sparse viewpoints demonstrate that our method recovers significantly more accurate geometry and appearance than state-of-the-art baselines.

2. 【2603.26661】GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

链接：https://arxiv.org/abs/2603.26661

作者：Nicolas von Lützow,Barbara Rössle,Katharina Schmid,Matthias Nießner

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generative modeling rely, recent advances, rely on diffusion, diffusion or flow-matching, generative modeling

备注： Project page: [this https URL](https://nicolasvonluetzow.github.io/GaussianGPT/) - Project video: [this https URL](https://youtu.be/zVnMHkFzHDg)

点击查看摘要

Abstract:Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

3. 【2603.26658】Zero-Shot Depth from Defocus

链接：https://arxiv.org/abs/2603.26658

作者：Yiming Zuo,Hongyu Wen,Venkat Subramanian,Patrick Chen,Karhan Kayan,Mario Bijelic,Felix Heide,Jia Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dense metric depth, metric depth map, estimating a dense, dense metric, metric depth

备注：

点击查看摘要

Abstract:Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at this https URL. The code and checkpoints are released at this https URL.

4. 【2603.26657】unable Soft Equivariance with Guarantees

链接：https://arxiv.org/abs/2603.26657

作者：Md Ashiqur Rahman,Lim Jun Hao,Jeremiah Jiang,Teck-Yian Lim,Raymond A. Yeh

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：computer vision models, real-world data, fundamental property, property in computer, computer vision

备注：

点击查看摘要

Abstract:Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

5. 【2603.26653】PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

链接：https://arxiv.org/abs/2603.26653

作者：Shaoxuan Li,Zhixuan Zhao,Hanze Deng,Zirun Ma,Shulin Tian,Zuyan Liu,Yushi Hu,Haoning Wu,Yuhao Dong,Benlin Liu,Ziwei Liu,Ranjay Krishna

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：manually annotated benchmark, manually annotated, reasoning, video reasoning, perception-centric video reasoning

备注： Project Page: [this https URL](https://perceptioncomp.github.io)

点击查看摘要

6. 【2603.26646】Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

链接：https://arxiv.org/abs/2603.26646

作者：Ling Li,Bowen Liu,Zinuo Zhan,Peng Jie,Jianhui Zhong,Kenglun Chang,Zhidong Deng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Traditional Visual Grounding, ignores non-verbal deictic, Traditional Visual, predominantly relies, localize objects

备注：

点击查看摘要

Abstract:Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

7. 【2603.26639】Make Geometry Matter for Spatial Reasoning

链接：https://arxiv.org/abs/2603.26639

作者：Shihua Zhang,Qiuhong Shen,Shizun Wang,Tianbo Pan,Xinchao Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：videos remains limited, achieve strong image, Empowered by large-scale, dynamic videos remains, video understanding

备注：

点击查看摘要

Abstract:Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at this https URL.

8. 【2603.26638】Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting

链接：https://arxiv.org/abs/2603.26638

作者：Nitin Kulkarni,Akhil Devarashetti,Charlie Cluss,Livio Forte,Philip Schneider,Chunming Qiao,Alina Vereshchaka

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：severe technical challenges, exteriors improves buyer, drive-throughs presents severe, presents severe technical, online automotive marketplaces

备注： 8 pages, 7 figures, Submitted to IEEE IROS 2026 (under review)

点击查看摘要

Abstract:High-fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive-throughs presents severe technical challenges. Unlike static-scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide-angle lens distortion, specular automotive paint, and non-rigid wheel rotations that violate classical epipolar constraints. We propose an end-to-end pipeline utilizing a two-pillar camera rig. First, we resolve dynamic-scene ambiguities by coupling SAM 3 for instance segmentation with motion-gating to cleanly isolate the moving vehicle, explicitly masking out non-rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig-aware SfM optimization that utilizes CAD-derived relative pose priors to eliminate scale drift. Finally, we use a distortion-aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real-world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held-out views, representing a 3.85 dB improvement over standard 3D-GS, delivering inspection-grade interactive 3D models without controlled studio infrastructure.

9. 【2603.26610】hink over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

链接：https://arxiv.org/abs/2603.26610

作者：Ruixing Zhang,Hanzhang Jiang,Leilei Sun,Liangzhe Han,Jibin Wang,Weifeng Lv

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Mobile devices continuously, generating massive volumes, understanding human mobility, devices continuously interact, Mobile devices

备注：

点击查看摘要

Abstract:Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.

10. 【2603.26599】VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

链接：https://arxiv.org/abs/2603.26599

作者：Zhaochong An,Orest Kupyn,Théo Uscidda,Andrea Colaco,Karan Ahuja,Serge Belongie,Mar Gonzalez-Franco,Marta Tintore Gazulla

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve impressive visual, Large-scale video diffusion, models achieve impressive, Large-scale video, achieve impressive

备注： Project Page: [this https URL](https://zhaochongan.github.io/projects/VGGRPO)

点击查看摘要

Abstract:Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

11. 【2603.26597】From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

链接：https://arxiv.org/abs/2603.26597

作者：Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Xilin Zhao,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted at CVPR 2026

点击查看摘要

None

12. 【2603.26589】he Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

链接：https://arxiv.org/abs/2603.26589

作者：Gillian Rosenberg,Skylar Stadhard,Bruce C. Hansen,Michelle R. Greene

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sufficient to learn, learn the full, full richness, scene understanding, distributional hypothesis

备注： 7 figures, 5 tables

点击查看摘要

Abstract:What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.

13. 【2603.26588】From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

链接：https://arxiv.org/abs/2603.26588

作者：Dávid Pukanec,Tibor Kubík,Michal Španěl

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：artificially created incomplete, created incomplete teeth, present ToothCraft, trained on artificially, contextual generation

备注： VISAPP 2026 Conference

点击查看摘要

Abstract:We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: this https URL

14. 【2603.26586】MA-Bench: Towards Fine-grained Micro-Action Understanding

链接：https://arxiv.org/abs/2603.26586

作者：Kun Li,Jihao Gu,Fei Wang,Zhiliang Wu,Hehe Fan,Dan Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: this https URL

15. 【2603.26584】Scene Grounding In the Wild

链接：https://arxiv.org/abs/2603.26584

作者：Tamir Cohen,Leo Segre,Shay Shomer-Chai,Shai Avidan,Hadar Averbuch-Elor

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reconstructing accurate, imagery remains, computer vision, Google Earth Studio, remains a core

备注： Project page at [this https URL](https://tau-vailab.github.io/SceneGround/)

点击查看摘要

Abstract:Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

16. 【2603.26571】Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

链接：https://arxiv.org/abs/2603.26571

作者：Ziyue Zeng,Xun Su,Haoyuan Liu,Bingyu Lu,Yui Tatsumi,Hiroshi Watanabe

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Existing generative video, modules atop conventional, atop conventional codecs, Existing generative, post-hoc reconstruction modules

备注： 9 pages, 3 figures

点击查看摘要

Abstract:Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.

17. 【2603.26553】HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

链接：https://arxiv.org/abs/2603.26553

作者：Lanmiao Liu,Esam Ghaleb,Aslı Özyürek,Zerrin Yumak

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：grounded gestures remains, significant advances, remains a challenge, semantically grounded gestures, producing holistic

备注：

点击查看摘要

Abstract:While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

18. 【2603.26551】Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

链接：https://arxiv.org/abs/2603.26551

作者：Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：modern computer vision, backbone networks play, networks play, play a central, central role

备注： Submitted to International Journal of Computer Vision (IJCV); currently under minor revision

点击查看摘要

Abstract:Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at this https URL.

19. 【2603.26546】AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing

链接：https://arxiv.org/abs/2603.26546

作者：Tianyu Liu,Weitao Xiong,Kunming Luo,Manyuan Zhang,Peng Liu,Yuan Liu,Ping Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：consistently demand massive, demand massive datasets, rare weather scenarios, learn rare weather, Generative video models

备注：

点击查看摘要

Abstract:Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.

20. 【2603.26541】OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

链接：https://arxiv.org/abs/2603.26541

作者：Zilong Deng,Federico Tombari,Marc Pollefeys,Johanna Wald,Daniel Barath

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex everyday environments, autonomous agents operating, everyday environments, Incremental open-vocabulary, essential for autonomous

备注：

点击查看摘要

Abstract:Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

21. 【2603.26528】Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation

链接：https://arxiv.org/abs/2603.26528

作者：Imad Ali Shah,Jiarong Li,Ethan Delaney,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high dimensionality poses, dimensionality poses challenges, sensing provides rich, scene understanding, poses challenges

备注：

点击查看摘要

Abstract:Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics-inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end-to-end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi-class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45\%, 0.45\%, and 1.04\%, and over learnable methods by 1.18\%, 1.56\%, and 0.81\% on HyKo, HSI-Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12--36 parameters compared to 51--22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low-order configurations are optimal, while the learned spectral filters converge to dataset-intrinsic wavelength patterns. These results demonstrate that physics-informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.

22. 【2603.26509】Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays

链接：https://arxiv.org/abs/2603.26509

作者：Martin Rath,Morteza Ghahremani,Yitong Li,Ashkan Taghipour,Marcus Makowski,Christian Wachinger

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high radiation exposure, Computed tomography, substantial costs, anatomical details, radiation exposure

备注：

点击查看摘要

Abstract:Computed tomography (CT) provides rich 3D anatomical details but is often constrained by high radiation exposure, substantial costs, and limited availability. While standard chest X-rays are cost-effective and widely accessible, they only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays offers a transformative solution to increase diagnostic accessibility, yet existing methods predominantly rely on synthetic X-ray projections, limiting clinical generalization. In this work, we propose AXON, a multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real X-rays. AXON employs a coarse-to-fine strategy, with a Brownian Bridge diffusion model-based initial stage for global structural synthesis, followed by a ControlNet-based refinement stage for local intensity optimization. It also supports bi-planar X-ray input to mitigate depth ambiguities inherent in 2D-to-3D reconstruction. A super-resolution network is integrated to upscale the generated volumes to achieve diagnostic-grade resolution. Evaluations on both public and external datasets demonstrate that AXON significantly outperforms state-of-the-art baselines, achieving a 11.9% improvement in PSNR and a 11.0% increase in SSIM with robust generalizability across disparate clinical distributions. Our code is available at this https URL.

23. 【2603.26486】ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

链接：https://arxiv.org/abs/2603.26486

作者：Mriganka Nath,Anurag Das,Jiahao Xie,Bernt Schiele

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, Large vision-language, tend to hallucinate, inputs are corrupted, Large

备注： 30 pages, 12 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

24. 【2603.26481】SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

链接：https://arxiv.org/abs/2603.26481

作者：Weihong Pan,Xiaoyu Zhang,Zhuang Zhang,Zhichao Ye,Nan Wang,Haomin Liu,Guofeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic real world, real world, photorealistic and immersive, limits practical scalability, High-quality

备注： CVPR 2026

点击查看摘要

Abstract:High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.

25. 【2603.26468】HyVIC: A Metric-Driven Spatio-Spectral Hyperspectral Image Compression Architecture Based on Variational Autoencoders

链接：https://arxiv.org/abs/2603.26468

作者：Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：necessitates effective compression, hyperspectral data archives, remote sensing, necessitates effective, storage and transmission

备注：

点击查看摘要

Abstract:The rapid growth of hyperspectral data archives in remote sensing (RS) necessitates effective compression methods for storage and transmission. Recent advances in learning-based hyperspectral image (HSI) compression have significantly enhanced both reconstruction fidelity and compression efficiency. However, existing methods typically adapt variational image compression models designed for natural images, without adequately accounting for the distinct spatio-spectral redundancies inherent in HSIs. In particular, they lack explicit architectural designs to balance spatial and spectral feature learning, limiting their ability to effectively leverage the unique characteristics of hyperspectral data. To address this issue, we introduce spatio-spectral variational hyperspectral image compression architecture (HyVIC). The proposed model comprises four main components: 1) adjustable spatio-spectral encoder; 2) spatio-spectral hyperencoder; 3) spatio-spectral hyperdecoder; and 4) adjustable spatio-spectral decoder. We demonstrate that the trade-off between spatial and spectral feature learning is crucial for the reconstruction fidelity, and therefore present a metric-driven strategy to systematically select the hyperparameters of the proposed model. Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model, achieving high spatial and spectral reconstruction fidelity across a wide range of compression ratios (CRs) and improving the state of the art by up to 4.66dB in terms of BD-PSNR. Based on our results, we offer insights and derive practical guidelines to guide future research directions in learning-based variational HSI compression. Our code and pre-trained model weights are publicly available at this https URL .

26. 【2603.26447】Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates

链接：https://arxiv.org/abs/2603.26447

作者：Shaurjya Mandal,Nutan Sharma,John Galeotti

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：single images remains, inherent depth ambiguity, Human mesh recovery, remains challenging due, images remains challenging

备注：

点击查看摘要

Abstract:Human mesh recovery from single images remains challenging due to inherent depth ambiguity and limited generalization across domains. While recent methods combine regression and optimization approaches, they struggle with poor initialization for test-time refinement and inefficient parameter updates during optimization. We propose a novel meta-learning framework that trains models to produce optimization-friendly initializations while incorporating uncertainty-aware adaptive updates during test-time refinement. Our approach introduces three key innovations: (1) a meta-learning strategy that simulates test-time optimization during training to learn better parameter initializations, (2) a selective parameter caching mechanism that identifies and freezes converged joints to reduce computational overhead, and (3) distribution-based adaptive updates that sample parameter changes from learned distributions, enabling robust exploration while quantifying uncertainty. Additionally, we employ stochastic approximation techniques to handle intractable gradients in complex loss landscapes. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance, reducing MPJPE by 10.3 on 3DPW and 8.0 on Human3.6M compared to strong baselines. Our approach shows superior domain adaptation capabilities with minimal performance degradation across different environmental conditions, while providing meaningful uncertainty estimates that correlate with actual prediction errors. Combining meta-learning and adaptive optimization enables accurate mesh recovery and robust generalization to challenging scenarios.

27. 【2603.26444】Image-based Quantification of Postural Deviations on Patients with Cervical Dystonia: A Machine Learning Approach Using Synthetic Training Data

链接：https://arxiv.org/abs/2603.26444

作者：Roland Stenger,Sebastian Löns,Nele Brügge,Feline Hamami,Alexander Münchau,Theresa Paulus,Anne Weissbach,Tatiana Usnich,Max Borsche,Martje G. Pauly,Lara M. Lange,Markus A. Hobert,Rebecca Herzog,Ana Luísa de Almeida Marcelino,Tina Mainka,Friederike Schumann,Lukas L. Goede,Johanna Reimer,Julienne Haas,Jos Becktepe,Alexander Baumann,Robin Wolke,Chi Wang Ip,Thorsten Odorfer,Daniel Zeller,Lisa Harder-Rauschenberger,John-Ih Lee,Philipp Albrecht,Tristan Kölsche,Joachim K. Krauss,Johanna M. Nagel,Joachim Runge,Johanna Doll-Lee,Simone Zittel,Kai Grimm,Pawel Tacik,André Lee,Tobias Bäumer,Sebastian Fudickar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Toronto Western Spasmodic, Western Spasmodic Torticollis, Toronto Western, Western Spasmodic, Cervical dystonia

备注：

点击查看摘要

Abstract:Cervical dystonia (CD) is the most common form of dystonia, yet current assessment relies on subjective clinical rating scales, such as the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), which requires expertise, is subjective and faces low inter-rater reliability some items of the score. To address the lack of established objective tools for monitoring disease severity and treatment response, this study validates an automated image-based head pose and shift estimation system for patients with CD. We developed an assessment tool that combines a pretrained head-pose estimation algorithm for rotational symptoms with a deep learning model trained exclusively on ~16,000 synthetic avatar images to evaluate rare translational symptoms, specifically lateral shift. This synthetic data approach overcomes the scarcity of clinical training examples. The system's performance was validated in a multicenter study by comparing its predicted scores against the consensus ratings of 20 clinical experts using a dataset of 100 real patient images and 100 labeled synthetic avatars. The automated system demonstrated strong agreement with expert clinical ratings for rotational symptoms, achieving high correlations for torticollis (r=0.91), laterocollis (r=0.81), and anteroretrocollis (r=0.78). For lateral shift, the tool achieved a moderate correlation (r=0.55) with clinical ratings and demonstrated higher accuracy than human raters in controlled benchmark tests on avatars. By leveraging synthetic training data to bridge the clinical data gap, this model successfully generalizes to real-world patients, providing a validated, objective tool for CD postural assessment that can enable standardized clinical decision-making and trial evaluation.

28. 【2603.26425】CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

链接：https://arxiv.org/abs/2603.26425

作者：Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：parallel processing capabilities, Recent research, high parallel processing, processing capabilities, architectures has predominantly

备注： Accepted at CVPR Findings 2026

点击查看摘要

Abstract:Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at this https URL.

29. 【2603.26400】SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training

链接：https://arxiv.org/abs/2603.26400

作者：Le Ma,Thiago Freitas dos Santos,Nadia Magnenat-Thalmann,Katarzyna Wac

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：expertise remains confined, expert-led skill assessment, proficiency development relies, difficult to scale, relies on expert-led

备注：

点击查看摘要

Abstract:In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. \textsc{SHands} captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.

30. 【2603.26385】Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

链接：https://arxiv.org/abs/2603.26385

作者：I-Hsiang Chen,Isma Hadji,Enrique Sanchez,Adrian Bulat,Sy-Yen Kuo,Radu Timofte,Georgios Tzimiropoulos,Brais Martinez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image restoration aims, Image restoration, recover high quality, Image Quality Assessment, quality image restoration

备注： Accepted by CVPR2026; Project Page: [this https URL](https://restore-assess-repeat.github.io)

点击查看摘要

Abstract:Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.

31. 【2603.26365】Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

链接：https://arxiv.org/abs/2603.26365

作者：Shida Wang,YongXiang Hua,Zhou Tao,Haoyu Cao,Linli Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注：

点击查看摘要

Abstract:Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.

32. 【2603.26362】HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

链接：https://arxiv.org/abs/2603.26362

作者：MD Khalequzzaman Chowdhury Sayem,Mubarrat Tajoar Chowdhury,Yihalem Yimolal Tiruneh,Muneeb A. Khan,Muhammad Salman Ali,Binod Bhattarai,Seungryul Baek

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：VR-based human-AI interaction, chip manufacturing, robot-assisted surgery, VR-based human-AI, articulation of human

备注： Accepted in CVPR 2026; Project page, code, and dataset: [this https URL](https://kcsayem.github.io/handvqa/)

点击查看摘要

Abstract:Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

33. 【2603.26357】MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

链接：https://arxiv.org/abs/2603.26357

作者：Quan Dao,Dimitris Metaxas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：flow-matching models due, strong performance compared, Diffusion Transformers, convolutional UNets, flow-matching models

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50\% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at \url{this https URL}

34. 【2603.26356】From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter

链接：https://arxiv.org/abs/2603.26356

作者：Zhenghao Xu(1),Mengning Yang(1) ((1) School of Big Data and Software Engineering, Chongqing University, Chongqing, China)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hand-drawn plot images, modern data visualization, plot images, reference plot images, standard plot images

备注：

点击查看摘要

Abstract:As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.

35. 【2603.26354】Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection

链接：https://arxiv.org/abs/2603.26354

作者：Nazia Aslam,Abhisek Ray,Thomas B. Moeslund,Kamal Nasrollahi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safety critical environments, Data Protection Regulation, General Data Protection, systems are increasingly, increasingly deployed

备注： 10 pages, CVPR conference

点击查看摘要

Abstract:Video anomaly detection (VAD) systems are increasingly deployed in safety critical environments and require a large amount of data for accurate detection. However, such data may contain personally identifiable information (PII), including facial cues and sensitive demographic attributes, creating compliance challenges under the EU General Data Protection Regulation (GDPR). In particular, GDPR requires that personal data be limited to what is strictly necessary for a specified processing purpose. To address this, we introduce Only What's Necessary, a privacy-by-design framework for VAD that explicitly controls the amount and type of visual information exposed to the detection pipeline. The framework combines breadth based and depth based data minimization mechanisms to suppress PII while preserving cues relevant to anomaly detection. We evaluate a range of minimization configurations by feeding the minimized videos to both a VAD model and a privacy inference model. We employ two ranking based methods, along with Pareto analysis, to characterize the resulting trade off between privacy and utility. From the non-dominated frontier, we identify sweet spot operating points that minimize personal data exposure with limited degradation in detection performance. Extensive experiments on publicly available datasets demonstrate the effectiveness of the proposed framework.

36. 【2603.26351】DuSCN-FusionNet: An Interpretable Dual-Channel Structural Covariance Fusion Framework for ADHD Classification Using Structural MRI

链接：https://arxiv.org/abs/2603.26351

作者：Qurat Ul Ain,Alptekin Temizel,Soyiba Jawed

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Attention Deficit Hyperactivity, Deficit Hyperactivity Disorder, prevalent neurodevelopmental condition, Attention Deficit, Hyperactivity Disorder

备注： 5 pages, 5 figures

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a highly prevalent neurodevelopmental condition; however, its neurobiological diagnosis remains challenging due to the lack of reliable imaging-based biomarkers, particularly anatomical markers. Structural MRI (sMRI) provides a non-invasive modality for investigating brain alterations associated with ADHD; nevertheless, most deep learning approaches function as black-box systems, limiting clinical trust and interpretability. In this work, we propose DuSCN-FusionNet, an interpretable sMRI-based framework for ADHD classification that leverages dual-channel Structural Covariance Networks (SCNs) to capture inter-regional morphological relationships. ROI-wise mean intensity and intra-regional variability descriptors are used to construct intensity-based and heterogeneity-based SCNs, which are processed through an SCN-CNN encoder. In parallel, auxiliary ROI-wise variability features and global statistical descriptors are integrated via late-stage fusion to enhance performance. The model is evaluated using stratified 10-fold cross-validation with a 5-seed ensemble strategy, achieving a mean balanced accuracy of 80.59% and an AUC of 0.778 on the Peking University site of the ADHD-200 dataset. DuSCN-FusionNet further achieves precision, recall, and F1-scores of 81.66%, 80.59%, and 80.27%, respectively. Moreover, Grad-CAM is adapted to the SCN domain to derive ROI-level importance scores, enabling the identification of structurally relevant brain regions as potential biomarkers.

37. 【2603.26348】Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

链接：https://arxiv.org/abs/2603.26348

作者：Shuai Lv,Chang Liu,Feng Tang,Yujie Yuan,Aojun Zhou,Kui Zhang,Xi Yang,Yangqiu Song

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multimodal Large Language, Large Language, outputs grow longer, recurring failure mode

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at this https URL.

38. 【2603.26341】HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network

链接：https://arxiv.org/abs/2603.26341

作者：Mingyu Zhang,Zixu Li,Zhiwei Chen,Zhiheng Fu,Xiaowei Zhu,Jiajia Nie,Yinwei Wei,Yupeng Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：image retrieval paradigm, challenging image retrieval, Composed Image Retrieval, Image Retrieval, retrieval paradigm

备注： Accepted by ICASSP 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at this https URL.

39. 【2603.26336】From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition

链接：https://arxiv.org/abs/2603.26336

作者：Nazia Aslam,Abhisek Ray,Joakim Bruslund Haurum,Lukas Esterle,Kamal Nasrollahi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, significantly improved video, improved video understanding, advances in large-scale, significantly improved

备注： 10 pages, CVPR paper

点击查看摘要

Abstract:Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.

40. 【2603.26330】Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

链接：https://arxiv.org/abs/2603.26330

作者：Yiming Ren,Yujiu Yang,Junjie Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：visual instruction data, degrading reasoning performance, persistent reasoning tax, improves perceptual capabilities, Supervised fine-tuning

备注：

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

41. 【2603.26328】Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization

链接：https://arxiv.org/abs/2603.26328

作者：Zidong Zhao,Yihao Huang,Qing Guo,Tianlin Li,Anran Li,Kailong Wang,Jin Song Dong,Geguang Pu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：convenient image creation, third-party platforms increasingly, platforms increasingly integrate, increasingly integrate multiple, multiple model APIs

备注： Accepted to CVPR 2026 (Findings)

点击查看摘要

Abstract:As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches its claim. Existing methods address this by using verification prompts generated by official model owners, but the generation relies on multiple reference models for optimization, leading to high computational cost and sensitivity to model selection. To address this problem, we propose a reference-free T2I model verification method called Boundary-aware Prompt Optimization (BPO). It directly explores the intrinsic characteristics of the target model. The key insight is that although different T2I models produce similar outputs for normal prompts, their semantic boundaries in the embedding space (transition zones between two concepts such as "corgi" and "bagel") are distinct. Prompts near these boundaries generate unstable outputs (e.g., sometimes a corgi and sometimes a bagel) on the target model but remain stable on other models. By identifying such boundary-adjacent prompts, BPO captures model-specific behaviors that serve as reliable verification cues for distinguishing T2I models. Experiments on five T2I models and four baselines demonstrate that BPO achieves superior verification accuracy.

42. 【2603.26320】DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

链接：https://arxiv.org/abs/2603.26320

作者：Jiayi Chen,Wenxuan Song,Shuai Chen,Jingbo Wang,Zhijun Li,Haoang Li

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：remain fundamentally limited, paradigms remain fundamentally, discrete tokenization scheme, existing decoding paradigms, decoding paradigms remain

备注：

点击查看摘要

Abstract:Vision--Language--Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7\% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available \url{this https URL}

43. 【2603.26317】Label-Free Cross-Task LoRA Merging with Null-Space Compression

链接：https://arxiv.org/abs/2603.26317

作者：Wonyoung Lee,Wooseong Jeong,Kuk-Jin Yoon

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：joint multi-task training, combines independently fine-tuned, independently fine-tuned checkpoints, merging combines independently, multi-task training

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $\Delta W = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.

44. 【2603.26316】SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

链接：https://arxiv.org/abs/2603.26316

作者：Cai Selvas-Sala,Lei Kang,Lluis Gomez

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：CLIP become integral, remove sensitive information, downstream systems, information is critical, integral to downstream

备注： Accepted to CVPR 2026. Project page: [this http URL](http://cvc-mmu.github.io/salmubench)

点击查看摘要

Abstract:As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.

45. 【2603.26299】Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy

链接：https://arxiv.org/abs/2603.26299

作者：Wooseong Jeong,Wonyoung Lee,Kuk-Jin Yoon

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：multiple Low-Rank Adaptation, constructing general-purpose systems, Low-Rank Adaptation, update directions span, Merging multiple Low-Rank

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model's ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.

46. 【2603.26285】PhysVid: Physics Aware Local Conditioning for Generative Video Models

链接：https://arxiv.org/abs/2603.26285

作者：Saurabh,Pathak,Elahe Arani,Mykola Pechenizkiy,Bahram Zonooz

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieve high visual, high visual fidelity, basic physical principles, models achieve high, limiting reliability

备注： Accepted for CVPR 2026

点击查看摘要

Abstract:Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\approx 33\%$ over baseline video generators, and by up to $\approx 8\%$ on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.

47. 【2603.26266】GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

链接：https://arxiv.org/abs/2603.26266

作者：Rui Xie,Zhi Gao,Chenrui Shi,Zirui Shang,Lu Chen,Qing Li

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, strong general capabilities, Large vision-language, endowed GUI agents, understanding and interaction

备注： 28 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent's corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE's generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.

48. 【2603.26263】DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation

链接：https://arxiv.org/abs/2603.26263

作者：Tomoya Miyawaki,Kazuto Nakashima,Yumi Iwashita,Ryo Kurazume

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：LiDAR-based semantic segmentation, autonomous mobile robots, LiDAR point clouds, LiDAR-based semantic, mobile robots

备注： ICRA 2026

点击查看摘要

Abstract:LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at this https URL.

49. 【2603.26262】GLASS: Geometry-aware Local Alignment and Structure Synchronization Network for 2D-3D Registration

链接：https://arxiv.org/abs/2603.26262

作者：Zhixin Cheng,Jiacheng Deng,Xinjun Li,Bohao Liao,Li Liu,Xiaotian Yin,Baoqun Yin,Tianzhu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：extracting patch-level correspondences, methods typically follow, extracting patch-level, typically follow, Local Geometry Enhancement

备注： Accepted by IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Image-to-point cloud registration methods typically follow a coarse-to-fine pipeline, extracting patch-level correspondences and refining them into dense pixel-to-point matches. However, in scenes with repetitive patterns, images often lack sufficient 3D structural cues and alignment with point clouds, leading to incorrect matches. Moreover, prior methods usually overlook structural consistency, limiting the full exploitation of correspondences. To address these issues, we propose two novel modules: the Local Geometry Enhancement (LGE) module and the Graph Distribution Consistency (GDC) module. LGE enhances both image and point cloud features with normal vectors, injecting geometric structure into image features to reduce mismatches. GDC constructs a graph from matched points to update features and explicitly constrain similarity distributions. Extensive experiments and ablations on two benchmarks, RGB-D Scenes v2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance in image-to-point cloud registration.

50. 【2603.26260】GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation

链接：https://arxiv.org/abs/2603.26260

作者：Xujing Tao,Chuxin Wang,Yubo Ai,Zhixin Cheng,Zhuoyuan Li,Liangsheng Liu,Yujia Chen,Xinjun Li,Qiao Li,Wenfei Yang,Tianzhu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：segment arbitrary categories, training set, aims to segment, segment arbitrary, arbitrary categories

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.

51. 【2603.26258】ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

链接：https://arxiv.org/abs/2603.26258

作者：David Hagerman,Roman Naeem,Erik Brorsson,Fredrik Kahl,Lennart Svensson

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：dense feature extraction, efficient dense feature, vision transformer, feature extraction, transformer for efficient

备注：

点击查看摘要

Abstract:We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.

52. 【2603.26250】Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment

链接：https://arxiv.org/abs/2603.26250

作者：Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autonomous tree pruning, unmanned aerial vehicles, safety-critical real-world task, Jetson Orin Super, NVIDIA Jetson Orin

备注：

点击查看摘要

Abstract:Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based stereo matcher - on a task-specific synthetic dataset and deploying the checkpoints on an NVIDIA Jetson Orin Super 16 GB. The training corpus is built in Unreal Engine 5 with a simulated ZED Mini stereo camera capturing 5,520 stereo pairs across 115 tree instances from three viewpoints at 2m distance; dense EXR depth maps provide exact, spatially complete supervision for thin branches. On the synthetic test set, DEFOM-Stereo ViT-S achieves the best depth-domain accuracy (EPE 1.74 px, D1-all 5.81%, delta-1 95.90%, depth MAE 23.40 cm) but its Jetson inference speed of ~2.2 FPS (~450 ms per frame) remains too slow for responsive closed-loop tool control. A newly introduced balanced variant, DEFOM-PrunePlus (~21M backbone, ~3.3 FPS on Jetson), offers the best deployable accuracy-speed trade-off (EPE 5.87 px, depth MAE 64.26 cm, delta-1 87.59%): its frame rate is sufficient for real-time guidance and its depth accuracy supports safe branch approach planning at the 2m operating range. The lightweight DEFOM-PruneStereo (~6.9 FPS) and DEFOM-PruneNano (~8.5 FPS) run fast but sacrifice substantial accuracy (depth MAE 57 cm), making estimates too unreliable for safe actuation. Zero-shot inference on real photographs confirms that full-capacity models preserve branch geometry, validating the sim-to-real transfer. We conclude that DEFOM-PrunePlus provides the most practical accuracy-latency balance for onboard distance estimation, while ViT-S serves as the reference for future hardware.

53. 【2603.26211】owards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

链接：https://arxiv.org/abs/2603.26211

作者：Shrinidhi Kumbhar,Haofu Liao,Srikar Appalaraju,Kunwar Yashraj Singh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：dominated multimodal understanding, long dominated multimodal, graphical user interface, GUI grounding, GUI

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.

54. 【2603.26206】4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation

链接：https://arxiv.org/abs/2603.26206

作者：Ningyuan Huang,Zhiheng Li,Zheng Fang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：loop closure detection, Place recognition, localization in robotics, crucial for loop, loop closure

备注： Accepted by ICRA 2026

点击查看摘要

Abstract:Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.

55. 【2603.26197】SAFT: Sensitivity-Aware Filtering and Transmission for Adaptive 3D Point Cloud Communication over Wireless Channels

链接：https://arxiv.org/abs/2603.26197

作者：Huda Adam Sirag Mekki,Hui Yuan,Mohanad M. G. Hassan,Zejia Chen,Guanghui Zhang

类目：Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)

关键词：Reliable transmission, point clouds, due to time-varying, clouds over wireless, challenging due

备注：

点击查看摘要

Abstract:Reliable transmission of 3D point clouds over wireless channels is challenging due to time-varying signal-to-noise ratio (SNR) and limited bandwidth. This paper introduces sensitivity-aware filtering and transmission (SAFT), a learned transmission framework that integrates a Point-BERT-inspired encoder, a sensitivity-guided token filtering (STF) unit, a quantization block, and an SNR-aware decoder for adaptive reconstruction. Specifically, the STF module assigns token-wise importance scores based on the reconstruction sensitivity of each token under channel perturbation. We further employ a training-only symbol-usage penalty to stabilize the discrete representation, without affecting the transmitted payload. Experiments on ShapeNet, ModelNet40, and 8iVFB show that SAFT improves geometric fidelity (D1/D2 PSNR) compared with a separate source--channel coding pipeline (G-PCC combined with LDPC and QAM) and existing learned baselines, with the largest gains observed in low-SNR regimes, highlighting improved robustness under limited bandwidth.

56. 【2603.26193】MemCam: Memory-Augmented Camera Control for Consistent Video Generation

链接：https://arxiv.org/abs/2603.26193

作者：Xinhang Gao,Junlin Guan,Shuhan Luo,Wenzhuo Li,Guanghuan Tan,Jiacheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Interactive video generation, video generation, Interactive video, significant potential, video

备注： 6 pages, 3 figures, 3 tables, accepted by IJCNN 2026

点击查看摘要

Abstract:Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.

57. 【2603.26192】HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning

链接：https://arxiv.org/abs/2603.26192

作者：Xuerui Zhang,Xuehao Wang,Zhan Zhuang,Linglan Zhao,Ziyue Li,Xinmin Zhang,Zhihuan Song,Yu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Lifelong learning aims, acquired from previous, LHL, preserve knowledge acquired, Lifelong learning

备注：

点击查看摘要

Abstract:Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textit{e.g.}, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.

58. 【2603.26190】Dual-Stage Invariant Continual Learning under Extreme Visual Sparsity

链接：https://arxiv.org/abs/2603.26190

作者：Rangya Zhang,Jiaping Xiao,Lu Bai,Yuhang Zhang,Mir Feroskhan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：maintain stable adaptation, balanced visual conditions, existing methods implicitly, methods implicitly assume, non-stationary environments

备注：

点击查看摘要

Abstract:Continual learning seeks to maintain stable adaptation under non-stationary environments, yet this problem becomes particularly challenging in object detection, where most existing methods implicitly assume relatively balanced visual conditions. In extreme-sparsity regimes, such as those observed in space-based resident space object (RSO) detection scenarios, foreground signals are overwhelmingly dominated by background observations. Under such conditions, we analytically demonstrate that background-driven gradients destabilize the feature backbone during sequential domain shifts, causing progressive representation drift. This exposes a structural limitation of continual learning approaches relying solely on output-level distillation, as they fail to preserve intermediate representation stability. To address this, we propose a dual-stage invariant continual learning framework via joint distillation, enforcing structural and semantic consistency on both backbone representations and detection predictions, respectively, thereby suppressing error propagation at its source while maintaining adaptability. Furthermore, to regulate gradient statistics under severe imbalance, we introduce a sparsity-aware data conditioning strategy combining patch-based sampling and distribution-aware augmentation. Experiments on a high-resolution space-based RSO detection dataset show consistent improvement over established continual object detection methods, achieving an absolute gain of +4.0 mAP under sequential domain shifts.

59. 【2603.26188】OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement

链接：https://arxiv.org/abs/2603.26188

作者：Rui Wang,Huisi Wu,Jing Qin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：assessing cardiac function, Accurate and temporally, temporally consistent segmentation, cardiac function, temporally consistent

备注：

点击查看摘要

Abstract:Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at this https URL.

60. 【2603.26186】Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI

链接：https://arxiv.org/abs/2603.26186

作者：Jing Zhang,Bastien Bergere,Emilie Bollache,Jonas Leite,Mikaël Laredo,Alban Redheuil,Nadjia Kachenoura

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Cardiac MRI late, MRI late gadolinium, Cardiac MRI, late gadolinium enhancement, enables non-invasive identification

备注： 16 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.

61. 【2603.26183】DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds

链接：https://arxiv.org/abs/2603.26183

作者：Pan Zhao,Hui Yuan,Chang Sun,Chongzhen Tian,Raouf Hamzaoui,Sam Kwong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing post-decoding quality, Existing post-decoding, dynamic point clouds, point clouds, frame independently

备注：

点击查看摘要

Abstract:Existing post-decoding quality enhancement methods for point clouds are designed for static data and typically process each frame independently. As a result, they cannot effectively exploit the spatiotemporal correlations present in point cloud this http URL propose a unified geometry and attribute enhancement framework (DUGAE) for G-PCC compressed dynamic point clouds that explicitly exploits inter-frame spatiotemporal correlations in both geometry and attributes. First, a dynamic geometry enhancement network (DGE-Net) based on sparse convolution (SPConv) and feature-domain geometry motion compensation (GMC) aligns and aggregates spatiotemporal information. Then, a detail-aware k-nearest neighbors (DA-KNN) recoloring module maps the original attributes onto the enhanced geometry at the encoder side, improving mapping completeness and preserving attribute details. Finally, a dynamic attribute enhancement network (DAE-Net) with dedicated temporal feature extraction and feature-domain attribute motion compensation (AMC) refines attributes by modeling complex spatiotemporal correlations. On seven dynamic point clouds from the 8iVFB v2, Owlii, and MVUB datasets, DUGAE significantly enhanced the performance of the latest G-PCC geometry-based solid content test model (GeS-TM v10). For geometry (D1), it achieved an average BD-PSNR gain of 11.03 dB and a 93.95% BD-bitrate reduction. For the luma component, it achieved a 4.23 dB BD-PSNR gain with a 66.61% BD-bitrate reduction. DUGAE also improved perceptual quality (as measured by PCQM) and outperformed V-PCC. Our source code will be released on GitHub at: this https URL

62. 【2603.26181】GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

链接：https://arxiv.org/abs/2603.26181

作者：Youngju Na,Jaeseong Yun,Soohyun Ryu,Hyunsu Kim,Sung-Eui Yoon,Suyong Yeon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：powerful paradigm, splatting has emerged, fundamentally fails, Gaussian splatting, glass panels

备注： CVPR 2026, Project page: [this https URL](https://youngju-na.github.io/GLINT)

点击查看摘要

Abstract:While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.

63. 【2603.26179】Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

链接：https://arxiv.org/abs/2603.26179

作者：Bozhao Li,Shaocong Wu,Tong Shao,Senqiao Yang,Qiben Shan,Zhuotao Tian,Jingyong Su

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：detection focus primarily, Recent advances, leveraging contrastive learning, open-vocabulary object detection, object detection focus

备注：

点击查看摘要

Abstract:Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: this https URL.

64. 【2603.26174】CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

链接：https://arxiv.org/abs/2603.26174

作者：Chonghuinan Wang,Zihan Chen,Yuxiang Wei,Tianyi Jiang,Xiaohe Wu,Fan Li,Wangmeng Zuo,Hongxun Yao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Instruction-based multimodal image, made rapid progress, recently made rapid, Instruction-based multimodal, Multimodal Large Language

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

65. 【2603.26173】ComVi: Context-Aware Optimized Comment Display in Video Playback

链接：https://arxiv.org/abs/2603.26173

作者：Minsun Kim,Dawon Lee,Junyong Noh

类目：Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)

关键词：general video-sharing platforms, general video-sharing, video-sharing platforms, displayed independently, video playback

备注： To appear in Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene, which can reveal spoilers and disrupt immersion. To address this problem, we present ComVi, a novel system that displays comments at contextually relevant moments, enabling viewers to see time-synchronized comments and video content together. We first map all comments to relevant video timestamps by computing audio-visual correlation, then construct the comment sequence through an optimization that considers temporal relevance, popularity (number of likes), and display duration for comfortable reading. In a user study, ComVi provided a significantly more engaging experience than conventional video interfaces (i.e., YouTube and Danmaku), with 71.9% of participants selecting ComVi as their most preferred interface.

66. 【2603.26168】Provably Contractive and High-Quality Denoisers for Convergent Restoration

链接：https://arxiv.org/abs/2603.26168

作者：Shubhi Shukla,Pravin Nair

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：degraded measurements, domains like surveillance, medical imaging, recovery of clean, Image restoration

备注：

点击查看摘要

Abstract:Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz $ 1$) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength $\|\delta\|\le\varepsilon$ induce at most $\varepsilon$ change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at this https URL

67. 【2603.26167】Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication

链接：https://arxiv.org/abs/2603.26167

作者：Yi Zhang,Hongbo Huang,Liang-Jie Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词：models generate high-quality, generate high-quality images, Diffusion models generate, violation and disinformation, models generate

备注： Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:Diffusion models generate high-quality images but pose serious risks like copyright violation and disinformation. Watermarking is a key defense for tracing and authenticating AI-generated content. However, existing methods rely on threshold-based detection, which only supports fuzzy matching and cannot recover structured watermark data bit-exactly, making them unsuitable for offline verification or applications requiring lossless metadata (e.g., licensing instructions). To address this problem, in this paper, we propose Gaussian Shannon, a watermarking framework that treats the diffusion process as a noisy communication channel and enables both robust tracing and exact bit recovery. Our method embeds watermarks in the initial Gaussian noise without fine-tuning or quality loss. We identify two types of channel interference, namely local bit flips and global stochastic distortions, and design a cascaded defense combining error-correcting codes and majority voting. This ensures reliable end-to-end transmission of semantic payloads. Experiments across three Stable Diffusion variants and seven perturbation types show that Gaussian Shannon achieves state-of-the-art bit-level accuracy while maintaining a high true positive rate, enabling trustworthy rights attribution in real-world deployment. The source code have been made available at: this https URL

68. 【2603.26154】IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios

链接：https://arxiv.org/abs/2603.26154

作者：Xiaofeng Li,Leyi Sheng,Zhen Sun,Zongmin Zhang,Jiaheng Wei,Xinlei He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：creating malicious content, significant concern, rapid advancement, creating malicious, malicious content

备注：

点击查看摘要

Abstract:With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment this http URL address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods' robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.

69. 【2603.26145】Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT

链接：https://arxiv.org/abs/2603.26145

作者：Shuhei Tsuyuki,Reda Bensaid,Jérémy Morlier,Mathieu Léonardon,Naoya Onizawa,Vincent Gripon,Takahiro Hanyu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：deep learning research, adaptable deep learning, deep learning models, highly efficient models, deep learning

备注：

点击查看摘要

Abstract:Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.

70. 【2603.26138】PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion

链接：https://arxiv.org/abs/2603.26138

作者：Humaira Kousar,Hasnain Irshad Bhatti,Jaekyun Moon

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Efficient data selection, minimizing annotation requirements, deep neural networks, Efficient data, data selection

备注： Published in TMLR (Featured Certification). arXiv admin note: substantial text overlap with [arXiv:2501.01118](https://arxiv.org/abs/2501.01118)

点击查看摘要

Abstract:Efficient data selection is crucial for enhancing the training efficiency of deep neural networks and minimizing annotation requirements. Traditional methods often face high computational costs, limiting their scalability and practical use. We introduce PruneFuse, a novel strategy that leverages pruned networks for data selection and later fuses them with the original network to optimize training. PruneFuse operates in two stages: First, it applies structured pruning to create a smaller pruned network that, due to its structural coherence with the original network, is well-suited for the data selection task. This small network is then trained and selects the most informative samples from the dataset. Second, the trained pruned network is seamlessly fused with the original network. This integration leverages the insights gained during the training of the pruned network to facilitate the learning process of the fused network while leaving room for the network to discover more robust solutions. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.

71. 【2603.26134】InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

链接：https://arxiv.org/abs/2603.26134

作者：Jintong Hu,Bin Chen,Zhenyu Hu,Jiayue Liu,Guo Wang,Lu Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reconstruct high-resolution frames, seeks to reconstruct, low-resolution inputs, reconstruct high-resolution, high-resolution frames

备注： 12 pages, 7 figures

点击查看摘要

Abstract:Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

72. 【2603.26128】axaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

链接：https://arxiv.org/abs/2603.26128

作者：Mridul Khurana,Amin Karimi Monsefi,Justin Lee,Medha Sawhney,David Carlyn,Julia Chae,Jianyang Gu,Rajiv Ramnath,Sara Beery,Wei-Lun Chao,Anuj Karpatne,Cheng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Tree of Life, Accurately generating images, subtle visual traits, Life is difficult, Accurately generating

备注：

点击查看摘要

Abstract:Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

73. 【2603.26127】Finding Distributed Object-Centric Properties in Self-Supervised Transformers

链接：https://arxiv.org/abs/2603.26127

作者：Samyak Rawlekar,Amitabh Swain,Yujun Cai,Yiwei Wang,Ming-Hsuan Yang,Narendra Ahuja

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：Self-supervised Vision Transformers, Self-supervised Vision, Vision Transformers, DINO show, typically observed

备注： Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

74. 【2603.26126】Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

链接：https://arxiv.org/abs/2603.26126

作者：Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Mingzhu Chen,Jiancan Wu,Kuien Liu,Xiang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Verifiable Rewards, improving final answer, final answer correctness, Recent advances, strengthening visual grounding

备注：

点击查看摘要

Abstract:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

75. 【2603.26122】SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

链接：https://arxiv.org/abs/2603.26122

作者：Zhangtianyi Chen,Yuhao Shen,Florensia Widjaja,Yan Xu,Liyuan Sun,Zijian Wang,Hongyi Chen,Wufei Dai,Juexiao Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, training data sparsity, Visual Question Answering, advancements in Large

备注：

点击查看摘要

Abstract:While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

76. 【2603.26109】SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection

链接：https://arxiv.org/abs/2603.26109

作者：Jiaming Liang,Yifeng Zhan,Chunlin Liu,Weihua Zheng,Bingye Peng,Qiwei Liang,Boyang Cai,Xiaochun Mai,Qiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：leveraging text prompts, Open-vocabulary object detection, Open-vocabulary object, text prompts, open world

备注： Accepted by CVPR2026

点击查看摘要

Abstract:Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector's ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.

77. 【2603.26108】Accurate Precipitation Forecast by Efficiently Learning from Massive Atmospheric Variables and Unbalanced Distribution

链接：https://arxiv.org/abs/2603.26108

作者：Shuangliang Li,Siwei Li,Li Li,Weijie Zou,Jie Yang,Maolin Zhang

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：0-24 hours, public safety, socioeconomic activities, activities and public, precipitation forecasting

备注：

点击查看摘要

Abstract:Short-term (0-24 hours) precipitation forecasting is highly valuable to socioeconomic activities and public safety. However, the highly complex evolution patterns of precipitation events, the extreme imbalance between precipitation and non-precipitation samples, and the inability of existing models to efficiently and effectively utilize large volumes of multi-source atmospheric observation data hinder improvements in precipitation forecasting accuracy and computational efficiency. To address the above challenges, this study developed a novel forecasting model capable of effectively and efficiently utilizing massive atmospheric observations by automatically extracting and iteratively predicting the latent features strongly associated with precipitation evolution. Furthermore, this study introduces a 'WMCE' loss function, designed to accurately discriminate extremely scarce precipitation events while precisely predicting their intensity values. Extensive experiments on two datasets demonstrate that our proposed model substantially and consistently outperforms all prevalent baselines in both accuracy and efficiency. Moreover, the proposed forecasting model substantially lowers the computational cost required to obtain valuable predictions compared to existing approaches, thereby positioning it as a milestone for efficient and practical precipitation forecasting.

78. 【2603.26096】AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation

链接：https://arxiv.org/abs/2603.26096

作者：Hyeongyu Kim,Geonhui Han,Dosik Hwang

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：mitigate performance degradation, updating model parameters, aims to mitigate, parameters during inference, mitigate performance

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Test-time adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift-robust test-time learning, broadening the prevailing affine-centric view of adaptation.

79. 【2603.26092】CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

链接：https://arxiv.org/abs/2603.26092

作者：Youngjun Song,Hyeongyu Kim,Dosik Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：enables real-time adaptation, Test-Time Adaptation, real-time adaptation, enables real-time, off-line retraining

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Test-Time Adaptation (TTA) enables real-time adaptation to domain shifts without off-line retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This leads to the following question: can we adaptively balance both strategies based on measured feature-level domain shift? We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.

80. 【2603.26088】Learnable Instance Attention Filtering for Adaptive Detector Distillation

链接：https://arxiv.org/abs/2603.26088

作者：Chen Liu,Qizhen Lan,Zhicheng Ding,Xinyu Chu,Qing Tian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve higher performance, grow increasingly complex, deep vision models, vision models grow, models grow increasingly

备注：

点击查看摘要

Abstract:As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.

81. 【2603.26081】Experimental study on surveillance video-based indoor occupancy measurement with occupant-centric control

链接：https://arxiv.org/abs/2603.26081

作者：Irfan Qaisar,Kailai Sun,Qingshan Jia,Qianchuan Zhao

类目：ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate occupancy information, closed-loop occupant-centric control, information is essential, essential for closed-loop, closed-loop occupant-centric

备注：

点击查看摘要

Abstract:Accurate occupancy information is essential for closed-loop occupant-centric control (OCC) in smart buildings. However, existing vision-based occupancy measurement methods often struggle to provide stable and accurate measurements in real indoor environments, and their implications for downstream HVAC control remain insufficiently studied. To achieve Net Zero emissions by 2050, this paper presents an experimental study of large language models (LLMs)-enhanced vision-based indoor occupancy measurement and its impact on OCC-enabled HVAC operation. Detection-only, tracking-based, and LLM-based refinement pipelines are compared under identical conditions using real surveillance data collected from a research laboratory in China, with frame-level manual ground-truth annotations. Results show that tracking-based methods improve temporal stability over detection-only measurement, while LLM-based refinement further improves occupancy measurement performance and reduces false unoccupied prediction. The best-performing pipeline, YOLOv8+DeepSeek, achieves an accuracy of 0.8824 and an F1-score of 0.9320. This pipeline is then integrated into an HVAC supervisory model predictive control framework in OpenStudio-EnergyPlus. Experimental results demonstrate that the proposed framework can support more efficient OCC operation, achieving a substantial HVAC energy-saving potential of 17.94%. These findings provide an effective methodology and practical foundation for future research in AI-enhanced smart building operations.

82. 【2603.26078】When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

链接：https://arxiv.org/abs/2603.26078

作者：Zhihan Chen,Yuhuan Zhao,Yijie Zhu,Xinyu Yao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：preserving single identities, achieved remarkable success, compose multiple interacting, remains largely unexplored, multiple interacting subjects

备注： 10 pages, 7 figures, accepted by CVPR 2026 Workshop P13N

点击查看摘要

Abstract:Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive "Illusion of Scalability" in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2's structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.

83. 【2603.26071】MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality

链接：https://arxiv.org/abs/2603.26071

作者：Kyungwon Kim,Dosik Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Accurate survival prediction, clinical deployment faces, frequently incomplete due, Accurate survival, technical limitations

备注： Accepted to CVPR 2026. 10 pages, 5 figures, supplementary included

点击查看摘要

Abstract:Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality's representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.

84. 【2603.26068】PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

链接：https://arxiv.org/abs/2603.26068

作者：Elkhan Ismayilzada,Yufei Zhang,Zijun Cui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Significant advancements made, delivered accurate single-frame, Significant advancements, accurate single-frame estimates, advancements made

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

85. 【2603.26067】R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.26067

作者：Tianrui Lou,Siyuan Liang,Jiawei Liang,Yuze Gao,Xiaochun Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：severe security threat, autonomous driving systems, mapping adversarial textures, poses a severe, severe security

备注： Under review

点击查看摘要

Abstract:Physical adversarial camouflage poses a severe security threat to autonomous driving systems by mapping adversarial textures onto 3D objects. Nevertheless, current methods remain brittle in complex dynamic scenarios, failing to generalize across diverse geometric (e.g., viewing configurations) and radiometric (e.g., dynamic illumination, atmospheric scattering) variations. We attribute this deficiency to two fundamental limitations in simulation and optimization. First, the reliance on coarse, oversimplified simulations (e.g., via CARLA) induces a significant domain gap, confining optimization to a biased feature space. Second, standard strategies targeting average performance result in a rugged loss landscape, leaving the camouflage vulnerable to configuration this http URL bridge these gaps, we propose the Relightable Physical 3D Gaussian Splatting (3DGS) based Attack framework (R-PGA). Technically, to address the simulation fidelity issue, we leverage 3DGS to ensure photo-realistic reconstruction and augment it with physically disentangled attributes to decouple intrinsic material from lighting. Furthermore, we design a hybrid rendering pipeline that leverages precise Relightable 3DGS for foreground rendering, while employing a pre-trained image translation model to synthesize plausible relighted backgrounds that align with the relighted this http URL address the optimization robustness issue, we propose the Hard Physical Configuration Mining (HPCM) module, designed to actively mine worst-case physical configurations and suppress their corresponding loss peaks. This strategy not only diminishes the overall loss magnitude but also effectively flattens the rugged loss landscape, ensuring consistent adversarial effectiveness and robustness across varying physical configurations.

86. 【2603.26064】MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection

链接：https://arxiv.org/abs/2603.26064

作者：Peiyuan Jiang,Yao Liu,Yanglei Gan,Jiaye Yang,Lu Liu,Daibing Yao,Qiao Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：detection remains challenging, stable cross-subject patterns, auditory deception cues, deception detection remains, automatic deception detection

备注：

点击查看摘要

Abstract:Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.

87. 【2603.26055】Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

链接：https://arxiv.org/abs/2603.26055

作者：Qizhi Xie,Kun Yuan,Yunpeng Qu,Ming Sun,Chao Zhou,Jihong Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurately estimating humans', estimating humans' subjective, humans' subjective feedback, Accurately estimating, motion consistency

备注： 14 pages, 6 figures. Accepted by CVPR 2026 findings track

点击查看摘要

Abstract:Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

88. 【2603.26052】Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

链接：https://arxiv.org/abs/2603.26052

作者：Zizhao Chen,Ping Wei,Ziyang Ren,Huan Li,Xiangru Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：grounding are crucial, Local Semantic Fusion, Mask-aware Local Semantic, local semantic, Hierarchical Semantic Aggregation

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

89. 【2603.26049】Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

链接：https://arxiv.org/abs/2603.26049

作者：Kang Liu,Zhuoqi Ma,Siyu Liang,Yunan Li,Xiyue Gao,Chao Liang,Kun Xie,Qiguang Miao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remains largely underexplored, medical vision-language pretraining, radiographs are typically, context-agnostic images, remains largely

备注： Code: [this https URL](https://github.com/mk-runner/CoGaze)

点击查看摘要

Abstract:Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at this https URL.

90. 【2603.26041】Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

链接：https://arxiv.org/abs/2603.26041

作者：Daiqiang Li,Zihao Pan,Zeyu Zhang,Ronghao Chen,Huacan Wang,Honggang Chen,Haiyun Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, demonstrated strong potential, Multimodal Large

备注：

点击查看摘要

Abstract:In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.

91. 【2603.26036】Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

链接：https://arxiv.org/abs/2603.26036

作者：Kutub Uddin,Nusrat Tasnim,Byung Tae Oh

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimedia data, including surveillance, visual interaction, evidence gathering, images and videos

备注：

点击查看摘要

Abstract:Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\% on FF++, 79.80\% on CDF1, 85.34\% on CDF2, 89.41\% on DFD, 84.07\% on DFDC, 95.62\% on DTIM, 80.76\% on PDD, and 100\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

92. 【2603.26033】Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs

链接：https://arxiv.org/abs/2603.26033

作者：Jiazheng Xing,Chao Xu,Hangjie Yuan,Mengmeng Wang,Jun Dan,Hangwei Qian,Yong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, few-shot action recognition

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature-caption-feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM's multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.

93. 【2603.26028】Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA

链接：https://arxiv.org/abs/2603.26028

作者：Zibo Xu,Qiang Li,Weizhi Nie,Yuting Su

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Medical Visual Question, Visual Question Answering, Medical Visual, Question Answering, Visual Question

备注：

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.

94. 【2603.26019】Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features

链接：https://arxiv.org/abs/2603.26019

作者：Mengdi Liu,Qiang Li,Weizhi Nie,Shaopeng Zhang,Yuting Su

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Type A Aortic, Aortic Dissection, precise preoperative evaluation, demands rapid, rapid and precise

备注：

点击查看摘要

Abstract:Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.

95. 【2603.26018】GeoReFormer: Geometry-Aware Refinement for Lane Segment Detection and Topology Reasoning

链接：https://arxiv.org/abs/2603.26018

作者：Danny Abraham,Nikhil Kamalkumar Advani,Arun Das,Nikil Dutt

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：online map construction, lane segment detection, autonomous driving, reasoning are critical, construction in autonomous

备注： 8 pages, 6 figures

点击查看摘要

Abstract:Accurate 3D lane segment detection and topology reasoning are critical for structured online map construction in autonomous driving. Recent transformer-based approaches formulate this task as query-based set prediction, yet largely inherit decoder designs originally developed for compact object detection. However, lane segments are continuous polylines embedded in directed graphs, and generic query initialization and unconstrained refinement do not explicitly encode this geometric and relational structure. We propose GeoReFormer (Geometry-aware Refinement Transformer), a unified query-based architecture that embeds geometry- and topology-aware inductive biases directly within the transformer decoder. GeoReFormer introduces data-driven geometric priors for structured query initialization, bounded coordinate-space refinement for stable polyline deformation, and per-query gated topology propagation to selectively integrate relational context. On the OpenLane-V2 benchmark, GeoReFormer achieves state-of-the-art performance with 34.5% mAP while improving topology consistency over strong transformer baselines, demonstrating the utility of explicit geometric and relational structure encoding.

96. 【2603.26015】VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

链接：https://arxiv.org/abs/2603.26015

作者：Rakib Hossain Sajib,Md Kishor Morol,Rajan Das Gupta,Mohammad Sakib Mahmood,Shuvra Smaran Das

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Human age estimation, challenging computer vision, large vision-language models, Human age, age estimation

备注：

点击查看摘要

Abstract:Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

97. 【2603.26008】FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

链接：https://arxiv.org/abs/2603.26008

作者：Mahesh Bhosale,Abdul Wasi,Shantam Srivastava,Shifa Latif,Tianyu Luan,Mingchen Gao,David Doermann,Xuan Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：multimodal large language, highlighting fairness risks, display uneven performance, multimodal large, powerful in image-conditioned

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at this https URL.

98. 【2603.25994】Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

链接：https://arxiv.org/abs/2603.25994

作者：Zhuan Shi,Alireza Dehghanpour Farashah,Rik de Vries,Golnoosh Farnadi

类目：Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词：diffusion models seeks, Concept, diffusion models, generative capability, remove undesired concepts

备注： Accepted by CVPR 2026 main

点击查看摘要

Abstract:Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.

99. 【2603.25993】FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

链接：https://arxiv.org/abs/2603.25993

作者：Changyang Li,Xueqing Huang,Shin-Fang Chng,Huangying Zhan,Qingan Yan,Yi Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reconstruction models provide, segmentation typically relies, strong geometric foundation, Feed-forward Anchored Scene, instance segmentation typically

备注：

点击查看摘要

Abstract:While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.

100. 【2603.25985】JRM: Joint Reconstruction Model for Multiple Objects without Alignment

链接：https://arxiv.org/abs/2603.25985

作者：Qirui Wu,Yawar Siddiqui,Duncan Frost,Samir Aroudj,Armen Avetisyan,Richard Newcombe,Angel X. Chang,Jakob Engel,Henry Howard-Jenkins

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Object-centric reconstruction seeks, Object-centric reconstruction, seeks to recover, Joint Reconstruction Model, Object-centric

备注：

点击查看摘要

Abstract:Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.

101. 【2603.25977】Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)

链接：https://arxiv.org/abs/2603.25977

作者：Gustavo Chau Loo Kung,Mohammad Abbasi,Camila Blank,Juze Zhang,Alan Q. Wang,Sophie Ostmeier,Akshay Chaudhari,Kilian Pohl,Ehsan Adeli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Magnetic Resonance Imaging, Diffusion Magnetic Resonance, Resonance Imaging, Magnetic Resonance, Diffusion Magnetic

备注：

点击查看摘要

Abstract:Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: this http URL.

102. 【2603.25968】Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

链接：https://arxiv.org/abs/2603.25968

作者：Zhuoli Zhuang,Yu-Cheng Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advancements, computer vision, vision have accelerated, accelerated the development, Recent

备注：

点击查看摘要

Abstract:Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: this https URL.

103. 【2603.25963】BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles

链接：https://arxiv.org/abs/2603.25963

作者：Shounak Sural,Ragunathan Rajkumar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safe widespread deployment, generated BEV segmentation, generated BEV, BEV segmentation, safe widespread

备注： 8 pages, 5 figures

点击查看摘要

Abstract:Localization in GNSS-denied and GNSS-degraded environments is a challenge for the safe widespread deployment of autonomous vehicles. Such GNSS-challenged environments require alternative methods for robust localization. In this work, we propose BEVMapMatch, a framework for robust vehicle re-localization on a known map without the need for GNSS priors. BEVMapMatch uses a context-aware lidar+camera fusion method to generate multimodal Bird's Eye View (BEV) segmentations around the ego vehicle in both good and adverse weather conditions. Leveraging a search mechanism based on cross-attention, the generated BEV segmentation maps are then used for the retrieval of candidate map patches for map-matching purposes. Finally, BEVMapMatch uses the top retrieved candidate for finer alignment against the generated BEV segmentation, achieving accurate global localization without the need for GNSS. Multiple frames of generated BEV segmentation further improve localization accuracy. Extensive evaluations show that BEVMapMatch outperforms existing methods for re-localization in GNSS-denied and adverse environments, with a Recall@1m of 39.8%, being nearly twice as much as the best performing re-localization baseline. Our code and data will be made available at this https URL.

104. 【2603.25951】Low-Rank-Modulated Functa: Exploring the Latent Space of Implicit Neural Representations for Interpretable Ultrasound Video Analysis

链接：https://arxiv.org/abs/2603.25951

作者：Julia Wolleb,Cristiana Baloescu,Alicia Durrer,Hemant D. Tagare,Xenophon Papademetris

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Implicit neural representations, image representation learning, continuous image representation, Implicit neural, representation learning

备注：

点击查看摘要

Abstract:Implicit neural representations (INRs) have emerged as a powerful framework for continuous image representation learning. In Functa-based approaches, each image is encoded as a latent modulation vector that conditions a shared INR, enabling strong reconstruction performance. However, the structure and interpretability of the corresponding latent spaces remain largely unexplored. In this work, we investigate the latent space of Functa-based models for ultrasound videos and propose Low-Rank-Modulated Functa (LRM-Functa), a novel architecture that enforces a low-rank adaptation of modulation vectors in the time-resolved latent space. When applied to cardiac ultrasound, the resulting latent space exhibits clearly structured periodic trajectories, facilitating visualization and interpretability of temporal patterns. The latent space can be traversed to sample novel frames, revealing smooth transitions along the cardiac cycle, and enabling direct readout of end-diastolic (ED) and end-systolic (ES) frames without additional model training. We show that LRM-Functa outperforms prior methods in unsupervised ED and ES frame detection, while compressing each video frame to as low as rank k=2 without sacrificing competitive downstream performance on ejection fraction prediction. Evaluations on out-of-distribution frame selection in a cardiac point-of-care dataset, as well as on lung ultrasound for B-line classification, demonstrate the generalizability of our approach. Overall, LRM-Functa provides a compact, interpretable, and generalizable framework for ultrasound video analysis. The code is available at this https URL.

105. 【2603.25946】Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets

链接：https://arxiv.org/abs/2603.25946

作者：Alex Koran,Dimitrios Sinodinos,Hadi Hojjati,Takuya Nanri,Fangge Chen,Narges Armanfard

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：High infraction rates, CARLA Leaderboard, infraction rates remain, High infraction, Multiple Instance Learning

备注： 33 pages, 11 figures

点击查看摘要

Abstract:High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.

106. 【2603.25942】Reinforcing Structured Chain-of-Thought for Video Understanding

链接：https://arxiv.org/abs/2603.25942

作者：Peiyao Wang,Haotian Xu,Noranart Vesdapunt,Rui Hou,Jingyi Zhang,Haibin Ling,Oleksandr Obiednikov,Ning Zhou,Kah Kuen Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multi-modal Large Language, Large Language Models, Multi-modal Large, Language Models, Large Language

备注： Accepted to CVPR 2026 (Main Conference)

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize - Think - Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.

107. 【2603.25935】DenseSwinV2: Channel Attentive Dual Branch CNN Transformer Learning for Cassava Leaf Disease Classification

链接：https://arxiv.org/abs/2603.25935

作者：Shah Saood(1),Saddam Hussain Khan(2) ((1) Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat 19060, Pakistan (2) Interdisciplinary Research Center for Smart Mobility and Logistics (IRC-SML), King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：hierarchical customized Swin, jointly leverages densely, leverages densely connected, customized Swin Transformer, densely connected convolutional

备注： 30 Pages, 12 Figures, 3 Tables

点击查看摘要

Abstract:This work presents a new Hybrid Dense SwinV2, a two-branch framework that jointly leverages densely connected convolutional features and hierarchical customized Swin Transformer V2 (SwinV2) representations for cassava disease classification. The proposed framework captures high resolution local features through its DenseNet branch, preserving the fine structural cues and also allowing for effective gradient flow. Concurrently, the customized SwinV2 models global contextual dependencies through the idea of shifted-window self attention, which enables the capture of long range interactions critical in distinguishing between visually similar lesions. Moreover, an attention channel-squeeze module is employed for each CNN Transformer stream independently to emphasize discriminative disease related responses and suppress redundant or background driven activations. Finally, these discriminative channels are fused to achieve refined representations from the dense local and SwinV2 global correlated strengthened feature maps, respectively. The proposed Dense SwinV2 utilized a public cassava leaf disease dataset of 31000 images, comprised of five diseases, including brown streak, mosaic, green mottle, bacterial blight, and normal leaf conditions. The proposed Dense SwinV2 demonstrates a significant classification accuracy of 98.02 percent with an F1 score of 97.81 percent, outperforming well-established convolutional and transformer models. These results underline the fact that Hybrid Dense SwinV2 offers robustness and practicality in the field level diagnosis of cassava disease and real world challenges related to occlusion, noise, and complex backgrounds.

108. 【2603.25931】DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

链接：https://arxiv.org/abs/2603.25931

作者：Abolfazl Meyarian,Amin Karimi Monsefi,Rajiv Ramnath,Ser-Nam Lim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：produce temporally coherent, generators produce temporally, routinely violate elementary, violate elementary physics, penalize per-frame deviations

备注：

点击查看摘要

Abstract:Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.

109. 【2603.25924】Good Scores, Bad Data: A Metric for Multimodal Coherence

链接：https://arxiv.org/abs/2603.25924

作者：Vasundra Srinivasan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Visual Question Answering, data is coherent, Question Answering, systems are evaluated, underlying data

备注： 9 pages, 6 figures, NeurIPS 2024 format

点击查看摘要

Abstract:Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.

110. 【2603.25906】Shared Representation for 3D Pose Estimation, Action Classification, and Progress Prediction from Tactile Signals

链接：https://arxiv.org/abs/2603.25906

作者：Isaac Han,Seoyoung Lee,Sangyeon Park,Ecehan Akan,Yiyue Luo,Joseph DelPreto,Kyung-Joong Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Estimating human pose, predicting movement progress, Estimating human, human-robot interaction, predicting movement

备注：

点击查看摘要

Abstract:Estimating human pose, classifying actions, and predicting movement progress are essential for human-robot interaction. While vision-based methods suffer from occlusion and privacy concerns in realistic environments, tactile sensing avoids these issues. However, prior tactile-based approaches handle each task separately, leading to suboptimal performance. In this study, we propose a Shared COnvolutional Transformer for Tactile Inference (SCOTTI) that learns a shared representation to simultaneously address three separate prediction tasks: 3D human pose estimation, action class categorization, and action completion progress estimation. To the best of our knowledge, this is the first work to explore action progress prediction using foot tactile signals from custom wireless insole sensors. This unified approach leverages the mutual benefits of multi-task learning, enabling the model to achieve improved performance across all three tasks compared to learning them independently. Experimental results demonstrate that SCOTTI outperforms existing approaches across all three tasks. Additionally, we introduce a novel dataset collected from 15 participants performing various activities and exercises, with 7 hours of total duration, across eight different activities.

111. 【2603.25901】Decoding Defensive Coverage Responsibilities in American Football Using Factorized Attention Based Transformer Models

链接：https://arxiv.org/abs/2603.25901

作者：Kevin Song,Evan Diewald,Ornob Siddiquee,Chris Boomhower,Keegan Abdoo,Mike Band,Amy Lee

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：National Football League, Football League, National Football, represent complex tactical, offense passing concept

备注： 19 pages, 8 figures, ISACE 2026

点击查看摘要

Abstract:Defensive coverage schemes in the National Football League (NFL) represent complex tactical patterns requiring coordinated assignments among defenders who must react dynamically to the offense's passing concept. This paper presents a factorized attention-based transformer model applied to NFL multi-agent play tracking data to predict individual coverage assignments, receiver-defender matchups, and the targeted defender on every pass play. Unlike previous approaches that focus on post-hoc coverage classification at the team level, our model enables predictive modeling of individual player assignments and matchup dynamics throughout the play. The factorized attention mechanism separates temporal and agent dimensions, allowing independent modeling of player movement patterns and inter-player relationships. Trained on randomly truncated trajectories, the model generates frame-by-frame predictions that capture how defensive responsibilities evolve from pre-snap through pass arrival. Our models achieve approximately 89\%+ accuracy for all tasks, with true accuracy potentially higher given annotation ambiguity in the ground truth labels. These outputs also enable novel derivative metrics, including disguise rate and double coverage rate, which enable enhanced storytelling in TV broadcasts as well as provide actionable insights for team strategy development and player evaluation.

112. 【2603.25892】HFM: A Unified Video Foundation Model for 4D Human Perception and Beyond

链接：https://arxiv.org/abs/2603.25892

作者：Letian Wang,Andrei Zanfir,Eduard Gabriel Bazavan,Misha Andriluka,Cristian Sminchisescu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：jointly addresses dense, addresses dense tasks, keypoint estimation, dense pose, present THFM

备注：

点击查看摘要

Abstract:We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals -- a capability that hasn't been demonstrated in the past.

113. 【2603.25891】Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods

链接：https://arxiv.org/abs/2603.25891

作者：Ofer Idan,Vladi Vexler,Gil Lederman,Dima Sivov,Aviad Cohen Zada,Shir Niego Komforti

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：nearest neighbor search, approximate nearest neighbor, commonly encoding images, Pre-trained vision-language models, excel in multimodal

备注：

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) excel in multimodal tasks, commonly encoding images as embedding vectors for storage in databases and retrieval via approximate nearest neighbor search (ANNS). However, these models struggle with compositional queries and out-of-distribution (OOD) image-text pairs. Inspired by human cognition's ability to learn from minimal examples, we address this performance gap through few-shot learning approaches specifically designed for image retrieval. We introduce the Few-Shot Text-to-Image Retrieval (FSIR) task and its accompanying benchmark dataset, FSIR-BD - the first to explicitly target image retrieval by text accompanied by reference examples, focusing on the challenging compositional and OOD queries. The compositional part is divided to urban scenes and nature species, both in specific situations or with distinctive features. FSIR-BD contains 38,353 images and 303 queries, with 82% comprising the test corpus (averaging per query 37 positives, ground truth matches, and significant number of hard negatives) and 18% forming the few-shot reference corpus (FSR) of exemplar positive and hard negative images. Additionally, we propose two novel retrieval optimization methods leveraging single shot or few shot reference examples in the FSR to improve performance. Both methods are compatible with any pre-trained image encoder, making them applicable to existing large-scale environments. Our experiments demonstrate that: (1) FSIR-BD provides a challenging benchmark for image retrieval; and (2) our optimization methods outperform existing baselines as measured by mean Average Precision (mAP). Further research into FSIR optimization methods will help narrow the gap between machine and human-level understanding, particularly for compositional reasoning from limited examples.

114. 【2603.25889】Polarization-Based Eye Tracking with Personalized Siamese Architectures

链接：https://arxiv.org/abs/2603.25889

作者：Beyza Kalkanli,Tom Bu,Mahsa Shakeri,Alexander Fix,Dave Stronks,Dmitri Model,Mantas Žurauskas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Head-mounted devices integrated, natural human-computer interaction, Head-mounted devices, human-computer interaction, devices integrated

备注： Accepted to ETRA 2026 as full paper

点击查看摘要

Abstract:Head-mounted devices integrated with eye tracking promise a solution for natural human-computer interaction. However, they typically require per-user calibration for optimal performance due to inter-person variability. A differential personalization approach using Siamese architectures learns relative gaze displacements and reconstructs absolute gaze from a small set of calibration frames. In this paper, we benchmark Siamese personalization on polarization-enabled eye tracking. For benchmarking, we use a 338-subject dataset captured with a polarization-sensitive camera and 850 nm illumination. We achieve performance comparable to linear calibration with 10-fold fewer samples. Using polarization inputs for Siamese personalization reduces gaze error by up to 12% compared to near-infrared (NIR)-based inputs. Combining Siamese personalization with linear calibration yields further improvements of up to 13% over a linearly calibrated baseline. These results establish Siamese personalization as a practical approach enabling accurate eye tracking.

115. 【2603.25887】World Reasoning Arena

链接：https://arxiv.org/abs/2603.25887

作者：PAN Team Institute of Foundation Models:Qiyue Gao,Kun Zhou,Jiannan Xiang,Zihan Liu,Dequan Yang,Junrong Chen,Arif Ahmad,Cong Zeng,Ganesh Bannur,Xinqi Huang,Zheqi Liu,Yi Gu,Yichi Yang,Guangyi Liu,Zhiting Hu,Zhengzhong Liu,Eric Xing

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Action Simulation Fidelity, agents to understand, intended to serve, serve as internal, internal simulators

备注：

点击查看摘要

Abstract:World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at this https URL.

116. 【2603.25886】Automated Quality Assessment of Blind Sweep Obstetric Ultrasound for Improved Diagnosis

链接：https://arxiv.org/abs/2603.25886

作者：Prasiddha Bhandari,Kanchan Poudel,Nishant Luitel,Bishram Acharya,Angelina Ghimire,Tyler Wellman,Kilian Koepsell,Pradeep Raj Regmi,Bishesh Khanal

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Blind Sweep Obstetric, automated Artificial Intelligence, Artificial Intelligence, allowing minimally trained, minimally trained operators

备注：

点击查看摘要

Abstract:Blind Sweep Obstetric Ultrasound (BSOU) enables scalable fetal imaging in low-resource settings by allowing minimally trained operators to acquire standardized sweep videos for automated Artificial Intelligence(AI) interpretation. However, the reliability of such AI systems depends critically on the quality of the acquired sweeps, and little is known about how deviations from the intended protocol affect downstream predictions. In this work, we present a systematic evaluation of BSOU quality and its impact on three key AI tasks: sweep-tag classification, fetal presentation classification, and placenta-location classification. We simulate plausible acquisition deviations, including reversed sweep direction, probe inversion, and incomplete sweeps, to quantify model robustness, and we develop automated quality-assessment models capable of detecting these perturbations. To approximate real-world deployment, we simulate a feedback loop in which flagged sweeps are re-acquired, showing that such correction improves downstream task performance. Our findings highlight the sensitivity of BSOU-based AI models to acquisition variability and demonstrate that automated quality assessment can play a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.

117. 【2603.25870】Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations

链接：https://arxiv.org/abs/2603.25870

作者：Suraj Prasad,Pinak Mahapatra

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Creating whiteboard-style educational, reproducible drawing representations, videos demands precise, demands precise coordination, existing method addresses

备注：

点击查看摘要

Abstract:Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.

118. 【2603.25867】Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception

链接：https://arxiv.org/abs/2603.25867

作者：Jingpei Lu,Fengyi Jiang,Xiaorui Zhang,Lingbo Jin,Omid Mohareri

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hinder vision-based functionalities, robot-assisted surgery relies, surgery relies heavily, severely degrade visual, degrade visual perception

备注： 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.

119. 【2603.25864】GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

链接：https://arxiv.org/abs/2603.25864

作者：Saelyne Yang,Jaesang Yu,Yi-Hao Peng,Kevin Qinghong Lin,Jae Won Cho,Yale Song,Juho Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：Graphical User Interface, Graphical User, User Interface, potential to assist, interacting with complex

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at this https URL.

120. 【2603.25863】Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation

链接：https://arxiv.org/abs/2603.25863

作者：Jasmine Moreira

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：MediaPipe Hand Landmarker, spatiotemporal matrix representation, Hand Landmarker, Brazilian Sign Language, convolutional neural network

备注： 6 pages, 10 figures, 1 table

点击查看摘要

Abstract:This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95\% accuracy under low-light conditions and 92\% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.

121. 【2603.25841】GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

链接：https://arxiv.org/abs/2603.25841

作者：Trong Thang Pham,Hien Nguyen,Ngan Le

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Current multimodal large, multimodal large language, effectively utilize eye-gaze, utilize eye-gaze information, Current multimodal

备注：

点击查看摘要

Abstract:Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the highest score among all open-source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at this https URL .

122. 【2603.25827】Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

链接：https://arxiv.org/abs/2603.25827

作者：Laura Fink,Linus Franke,George Kopanas,Marc Stamminger,Peter Hedman

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Signed Distance Field, unstructured image collections, dense Signed Distance, Distance Field, Signed Distance

备注：

点击查看摘要

Abstract:We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.25827 [cs.CV]

(or
arXiv:2603.25827v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25827

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

123. 【2603.25823】ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

链接：https://arxiv.org/abs/2603.25823

作者：Haonan Han,Jiancheng Huang,Xiaopeng Sun,Junyan He,Rui Yang,Jie Hu,Xiaojiang Peng,Lin Ma,Xiaoming Wei,Xiu Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：stunning visual fidelity, modern AIGC models, AIGC models lies, Beneath the stunning, modern AIGC

备注：

点击查看摘要

Abstract:Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at this https URL

124. 【2603.25819】Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

链接：https://arxiv.org/abs/2603.25819

作者：Yancheng Zhang,Xiaohan Zhang,Guangyu Sun,Zonglin Lyu,Safwan Wshah,Chen Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：establishing geometric correspondences, Geometric Foundation Models, Recent Geometric Foundation, Cross-View Image Synthesis, rely on establishing

备注：

点击查看摘要

Abstract:Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.

125. 【2603.25803】Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment

链接：https://arxiv.org/abs/2603.25803

作者：Spiros Baxevanakis,Platon Karageorgis,Ioannis Dravilas,Konrad Szewczyk

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Training Vision Transformers, presents significant challenges, Training Vision, Vision Transformers, hindering their interpretability

备注： Preprint. Submitted to Transactions on Machine Learning Research (TMLR). 26 pages, 17 figures

点击查看摘要

Abstract:Training Vision Transformers (ViTs) presents significant challenges, one of which is the emergence of artifacts in attention maps, hindering their interpretability. Darcet et al. (2024) investigated this phenomenon and attributed it to the need of ViTs to store global information beyond the [CLS] token. They proposed a novel solution involving the addition of empty input tokens, named registers, which successfully eliminate artifacts and improve the clarity of attention maps. In this work, we reproduce the findings of Darcet et al. (2024) and evaluate the generalizability of their claims across multiple models, including DINO, DINOv2, OpenCLIP, and DeiT3. While we confirm the validity of several of their key claims, our results reveal that some claims do not extend universally to other models. Additionally, we explore the impact of model size, extending their findings to smaller models. Finally, we untie terminology inconsistencies found in the original paper and explain their impact when generalizing to a wider range of models.

126. 【2603.25802】LEMON: a foundation model for nuclear morphology in Computational Pathology

链接：https://arxiv.org/abs/2603.25802

作者：Loïc Chadoutaud(1, 2, 3),Alice Blondel(1, 2, 3),Hana Feki(1, 2, 3),Jacqueline Fontugne(4, 5),Emmanuel Barillot(1, 2, 3),Thomas Walter(1, 2, 3) ((1) Institut Curie, Paris, France, (2) Mines Paris PSL, Centre for Computational Biology (CBIO), Paris, France, (3) INSERM U1331, Paris, France, (4) Institut Curie, U1353/UMR9029 IRIS, Equipe IMPACT, Paris, France, (5) Department of Pathology, Université Paris-Saclay, UVSQ, Institut Curie, Saint-Cloud, France)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：effective representation learning, precision medicine, relies on effective, research and precision, representation learning

备注：

点击查看摘要

Abstract:Computational pathology relies on effective representation learning to support cancer research and precision medicine. Although self-supervised learning has driven major progress at the patch and whole-slide image levels, representation learning at the single-cell level remains comparatively underexplored, despite its importance for characterizing cell types and cellular phenotypes. We introduce LEMON (Learning Embeddings from Morphology Of Nuclei), a self-supervised foundation model for scalable single-cell image representation learning. Trained on millions of cell images from diverse tissues and cancer types, LEMON learns robust and versatile morphological representations that support large-scale single-cell analyses in pathology. We evaluate LEMON on five benchmark datasets across a range of prediction tasks and show that it provides strong performance, highlighting its potential as a new paradigm for cell-level computational pathology. Model weights are available at this https URL.

127. 【2603.25798】End-to-end Feature Alignment: A Simple CNN with Intrinsic Class Attribution

链接：https://arxiv.org/abs/2603.25798

作者：Parniyan Farvardin,David Chapman

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：present Feature-Align CNN, prototype CNN architecture, Feature-Align CNN, prototype CNN, CNN architecture

备注：

点击查看摘要

Abstract:We present Feature-Align CNN (FA-CNN), a prototype CNN architecture with intrinsic class attribution through end-to-end feature alignment. Our intuition is that the use of unordered operations such as Linear and Conv2D layers cause unnecessary shuffling and mixing of semantic concepts, thereby making raw feature maps difficult to understand. We introduce two new order preserving layers, the dampened skip connection, and the global average pooling classifier head. These layers force the model to maintain an end-to-end feature alignment from the raw input pixels all the way to final class logits. This end-to-end alignment enhances the interpretability of the model by allowing the raw feature maps to intrinsically exhibit class attribution. We prove theoretically that FA-CNN penultimate feature maps are identical to Grad-CAM saliency maps. Moreover, we prove that these feature maps slowly morph layer-by-layer over network depth, showing the evolution of features through network depth toward penultimate class activations. FA-CNN performs well on benchmark image classification datasets. Moreover, we compare the averaged FA-CNN raw feature maps against Grad-CAM and permutation methods in a percent pixels removed interpretability task. We conclude this work with a discussion and future, including limitations and extensions toward hybrid models.

128. 【2603.25791】ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

链接：https://arxiv.org/abs/2603.25791

作者：Zikai Wang,Zhilu Zhang,Yiqing Wang,Hui Li,Wangmeng Zuo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generally require pre-scanning, Existing hand-object interactions, monocular RGB video, articulated objects generally, objects generally require

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: this https URL.

129. 【2603.25778】Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

链接：https://arxiv.org/abs/2603.25778

作者：Yuan Zhang,Sihao Dou,Kai Hu,Shuhua Deng,Chunhong Cao,Fen Xiao,Xieping Gao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：limited high-quality annotations, early gastrointestinal screening, Endoscopic video analysis, high-quality annotations, analysis is essential

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at this https URL.

130. 【2603.25765】Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification

链接：https://arxiv.org/abs/2603.25765

作者：Binwei Chen,Huachao Leng,Chi Yeung Mang,Tsz Wai Cheung,Yanhua Chen,Wai Keung Anthony Loh,Chi Ho Wong,Chak Yin Tang

类目：Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)

关键词：Hard coatings play, demand superior mechanical, Hard coatings, superior mechanical performance, offering outstanding hardness

备注：

点击查看摘要

Abstract:Hard coatings play a critical role in industry, with ceramic materials offering outstanding hardness and thermal stability for applications that demand superior mechanical performance. However, deploying artificial intelligence (AI) for surface roughness classification is often constrained by the need for large labeled datasets and costly high-resolution imaging equipment. In this study, we explore the use of synthetic images, generated with Stable Diffusion XL, as an efficient alternative or supplement to experimentally acquired data for classifying ceramic surface roughness. We show that augmenting authentic datasets with generative images yields test accuracies comparable to those obtained using exclusively experimental images, demonstrating that synthetic images effectively reproduce the structural features necessary for classification. We further assess method robustness by systematically varying key training hyperparameters (epoch count, batch size, and learning rate), and identify configurations that preserve performance while reducing data requirements. Our results indicate that generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering a practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.

131. 【2603.25761】A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents

链接：https://arxiv.org/abs/2603.25761

作者：Fitsum Sileshi Beyene,Christopher L. Dancy

类目：Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词：Optical character recognition, systems increasingly rely, document understanding systems, evaluation remains centered, understanding systems increasingly

备注： This manuscript is the author's submitted version to the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2026). Please cite the final published version via ACM Digital Library when available

点击查看摘要

Abstract:Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely appear in reported training data or evaluation benchmarks. Most evaluations emphasize character accuracy and task success on modern layouts. They rarely capture structural failures common in historical newspapers, including column collapse, typographic errors, and hallucinated text. To put these findings into perspective, we use previous empirical studies and archival statistics from significant Black press collections to show how evaluation gaps lead to structural invisibility and representational harm. We propose that these gaps occur due to organizational (meso) and institutional (macro) behaviors and structure, shaped by benchmark incentives and data governance decisions.

132. 【2603.25758】A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

链接：https://arxiv.org/abs/2603.25758

作者：Changyu Liu,James Chenhao Liang,Wenhao Yang,Yiming Cui,Jinghao Yang,Tianyang Wang,Qifan Wang,Dongfang Liu,Cheng Han

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：generative artificial intelligence, Diffusion models, discriminative representation learning, significantly reshaped, reshaped the field

备注：

点击查看摘要

Abstract:Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT's most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.

133. 【2603.26393】Adapting Frozen Mono-modal Backbones for Multi-modal Registration via Contrast-Agnostic Instance Optimization

链接：https://arxiv.org/abs/2603.26393

作者：Yi Zhang,Yidong Zhao,Qian Tao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：intensity distributions vary, distributions vary significantly, medical image analysis, Deformable image registration, Deformable image

备注： MICCAI Learn2Reg Challenge

点击查看摘要

Abstract:Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbf{mono-modal} registration model with a lightweight adaptation pipeline for \textbf{multi-modal} image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.

134. 【2603.26117】FINDER: Zero-Shot Field-Integrated Network for Distortion-free EPI Reconstruction in Diffusion MRI

链接：https://arxiv.org/abs/2603.26117

作者：Namgyu Han,Seong Dae Yun,Chaeeun Lim,Sunghyun Seok,Sunju Kim,Yoonhwan Kim,Yohan Jun,Tae Hyung Kim,Berkin Bilgic,Jaejin Cho

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：rapid sampling scheme, sequence highly sensitive, Echo-planar imaging, Distortion-free EPI Reconstruction, severe geometric distortions

备注： 11 pages, 4 figures

点击查看摘要

Abstract:Echo-planar imaging (EPI) remains the cornerstone of diffusion MRI, but it is prone to severe geometric distortions due to its rapid sampling scheme that renders the sequence highly sensitive to $B_{0}$ field inhomogeneities. While deep learning has helped improve MRI reconstruction, integrating robust geometric distortion correction into a self-supervised framework remains an unmet need. To address this, we present FINDER (Field-Integrated Network for Distortion-free EPI Reconstruction), a novel zero-shot, scan-specific framework that reformulates reconstruction as a joint optimization of the underlying image and the $B_{0}$ field map. Specifically, we employ a physics-guided unrolled network that integrates dual-domain denoisers and virtual coil extensions to enforce robust data consistency. This is coupled with an Implicit Neural Representation (INR) conditioned on spatial coordinates and latent image features to model the off-resonance field as a continuous, differentiable function. Employing an alternating minimization strategy, FINDER synergistically updates the reconstruction network and the field map, effectively disentangling susceptibility-induced geometric distortions from anatomical structures. Experimental results demonstrate that FINDER achieves superior geometric fidelity and image quality compared to state-of-the-art baselines, offering a robust solution for high-quality diffusion imaging.

135. 【2603.26014】Cone-Beam CT Image Quality Enhancement Using A Latent Diffusion Model Trained with Simulated CBCT Artifacts

链接：https://arxiv.org/abs/2603.26014

作者：Naruki Murahashi,Mitsuhiro Nakamura,Megumi Nakao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Cone-beam computed tomography, Cone-beam computed, high artifact content, image quality, CBCT image quality

备注：

点击查看摘要

Abstract:Cone-beam computed tomography (CBCT) images are problematic in clinical medicine because of their low contrast and high artifact content compared with conventional CT images. Although there are some studies to improve image quality, in regions subject to organ deformation, the anatomical structure may change after such image quality improvement. In this study, we propose an overcorrection-free CBCT image quality enhancement method based on a conditional latent diffusion model using pseudo-CBCT images. Pseudo-CBCT images are created from CT images using a simple method that simulates CBCT artifacts and are spatially consistent with the CT images. By performing self-supervised learning with these spatially consistent paired images, we can improve image quality while maintaining anatomical structures. Furthermore, extending the framework of the conditional diffusion model to latent space improves the efficiency of image processing. Our model was trained on pelvic CT-pseudo-CBCT paired data and was applied to both pseudo-CBCT and real CBCT data. The experimental results using data of 75 cases show that with our proposed method, the structural changes were less than 1/1000th (in terms of the number of pixels) of those of a conventional method involving learning with real images, and the correlation coefficient between the CT value distributions of the generated and reference images was 0.916, approaching the same level as conventional methods. We also confirmed that the proposed framework achieves faster processing and superior improvement performance compared with the framework of a conditional diffusion model, even under constrained training settings.

136. 【2603.26007】Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer's Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort

链接：https://arxiv.org/abs/2603.26007

作者：Ishaan Cherukuri

类目：Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：mild cognitive impairment, Alzheimer disease, progress to Alzheimer, cognitive impairment, stages of neurodegeneration

备注：

点击查看摘要

Abstract:Predicting whether someone with mild cognitive impairment (MCI) will progress to Alzheimer's disease (AD) is crucial in the early stages of neurodegeneration. This uncertainty limits enrollment in clinical trials and delays urgent treatment. The Boundary Sharpness Coefficient (BSC) measures how well-defined the gray-white matter boundary looks on structural MRI. This study measures how BSC changes over time, namely, how fast the boundary degrades each year works much better than looking at a single baseline scan for predicting MCI-to-AD conversion. This study analyzed 1,824 T1-weighted MRI scans from 450 ADNI subjects (95 converters, 355 stable; mean follow-up: 4.84 years). BSC voxel-wise maps were computed using tissue segmentation at the gray-white matter cortical ribbon. Previous studies have used CNN and RNN models that reached 96.0% accuracy for AD classification and 84.2% for MCI conversion, but those approaches disregard specific regions within the brain. This study focused specifically on the gray-white matter interface. The approach uses temporal slope features capturing boundary degradation rates, feeding them into Random Survival Forest, a non-parametric ensemble method for right-censored survival data. The Random Survival Forest trained on BSC slopes achieved a test C-index of 0.63, a 163% improvement over baseline parametric models (test C-index: 0.24). Structural MRI costs a fraction of PET imaging ($800--$1,500 vs. $5,000--$7,000) and does not require CSF collection. These temporal biomarkers could help with patient-centered safety screening as well as risk assessment.

137. 【2603.25945】Adapting Segment Anything Model 3 for Concept-Driven Lesion Segmentation in Medical Images: An Experimental Study

链接：https://arxiv.org/abs/2603.25945

作者：Guoping Xu,Jayaram K. Udupa,Yubing Tong,Xin Long,Ying Zhang,Jie Deng,Weiguo Lu,You Zhang

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：specific anatomical sites, medical image analysis, limiting their generalizability, medical image, image analysis

备注： 31 pages, 8 figures

点击查看摘要

Abstract:Accurate lesion segmentation is essential in medical image analysis, yet most existing methods are designed for specific anatomical sites or imaging modalities, limiting their generalizability. Recent vision-language foundation models enable concept-driven segmentation in natural images, offering a promising direction for more flexible medical image analysis. However, concept-prompt-based lesion segmentation, particularly with the latest Segment Anything Model 3 (SAM3), remains underexplored. In this work, we present a systematic evaluation of SAM3 for lesion segmentation. We assess its performance using geometric bounding boxes and concept-based text and image prompts across multiple modalities, including multiparametric MRI, CT, ultrasound, dermoscopy, and endoscopy. To improve robustness, we incorporate additional prior knowledge, such as adjacent-slice predictions, multiparametric information, and prior annotations. We further compare different fine-tuning strategies, including partial module tuning, adapter-based methods, and full-model optimization. Experiments on 13 datasets covering 11 lesion types demonstrate that SAM3 achieves strong cross-modality generalization, reliable concept-driven segmentation, and accurate lesion delineation. These results highlight the potential of concept-based foundation models for scalable and practical medical image segmentation. Code and trained models will be released at: this https URL

Comments:
31 pages, 8 figures

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.25945 [eess.IV]

(or
arXiv:2603.25945v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2603.25945

Focus to learn more

              arXiv-issued DOI via DataCite</p>

138. 【2603.25869】Learning to Recorrupt: Noise Distribution Agnostic Self-Supervised Image Denoising

链接：https://arxiv.org/abs/2603.25869

作者：Brayan Monroy,Jorge Bacca,Julián Tachella

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词：trivial identity mapping, specialized loss functions, Self-supervised image denoising, Self-supervised image, identity mapping

备注：

点击查看摘要

Abstract:Self-supervised image denoising methods have traditionally relied on either architectural constraints or specialized loss functions that require prior knowledge of the noise distribution to avoid the trivial identity mapping. Among these, approaches such as Noisier2Noise or Recorrupted2Recorrupted, create training pairs by adding synthetic noise to the noisy images. While effective, these recorruption-based approaches require precise knowledge of the noise distribution, which is often not available. We present Learning to Recorrupt (L2R), a noise distribution-agnostic denoising technique that eliminates the need for knowledge of the noise distribution. Our method introduces a learnable monotonic neural network that learns the recorruption process through a min-max saddle-point objective. The proposed method achieves state-of-the-art performance across unconventional and heavy-tailed noise distributions, such as log-gamma, Laplace, and spatially correlated noise, as well as signal-dependent noise models such as Poisson-Gaussian noise.