本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新548篇论文，其中：

自然语言处理70篇
信息检索11篇
计算机视觉175篇

自然语言处理

1. 【2603.06552】KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

链接：https://arxiv.org/abs/2603.06552

作者：Archie Sage,Salvatore Greco

类目：Computation and Language (cs.CL)

关键词：KCLarity team participation, political discourse, paper describes, describes the KCLarity, KCLarity team

备注： Under review at SemEval 2026

点击查看摘要

Abstract:This paper describes the KCLarity team's participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.

2. 【2603.06505】Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

链接：https://arxiv.org/abs/2603.06505

作者：Yuchen Zhang,Haralambos Mouratidis,Ravi Shekhar

类目：Computation and Language (cs.CL)

关键词：systems remain constrained, Automatic speech recognition, Automatic speech, isolated utterances, settings and short

备注： Accepted at LREC 2026

点击查看摘要

Abstract:Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.

3. 【2603.06503】Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

链接：https://arxiv.org/abs/2603.06503

作者：Anmol Gulati,Sahil Sen,Waqar Sarguroh,Kevin Paul

类目：Computation and Language (cs.CL)

关键词：enable Large Language, Large Language Models, Large Language, multimodal Retrieval-Augmented Generation, enable Large

备注：

点击查看摘要

Abstract:Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.

4. 【2603.06495】COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

链接：https://arxiv.org/abs/2603.06495

作者：Kartik Sharma,Rakshit S. Trivedi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：signals require hundreds, sample-efficient methods suboptimally, methods enable inference-time, methods suboptimally capture, capture steering signals

备注： ICLR 2026. Code available at [this https URL](https://github.com/Ksartik/cold-steer)

点击查看摘要

Abstract:Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

5. 【2603.06492】NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

链接：https://arxiv.org/abs/2603.06492

作者：Ethan Smith(Canva Research)

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词：nonlinear low-rank branches, adds nonlinear low-rank, Nonlinear lOw-rank Branch, Nonlinear lOw-rank, transformer linear layers

备注： 14 pages, 5 figures, 5 tables

点击查看摘要

Abstract:We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes {\sigma}(xWdown)Wup where {\sigma} is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup. Experiments on LLMs (250M and 1.5B parameters), BERT, VQGAN, and ViT consistently show improved training efficiency. We identify one caveat: Mixup/CutMix augmentation interferes with NOBLE's benefits in Imagenet classification along with other stochastic augmentations, but when disabled, ViT also improves. This discrepancy is possibly explained by regularization techniques that encourage smoother fits to the target function while NOBLE may specialize more in sharper aspects of the target function.

6. 【2603.06485】PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations

链接：https://arxiv.org/abs/2603.06485

作者：Vittoria Vineis,Matteo Silvestri,Lorenzo Antonelli,Filippo Betello,Gabriele Tolomei

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Explainable Artificial Intelligence, machine learning systems, Explainable Artificial, Artificial Intelligence, neglects user differences

备注： 15 pages, 2 figures

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a one-size-fits-all paradigm that neglects user differences in expertise, goals, and cognitive needs. Although Large Language Models can translate technical explanations into natural language, they introduce challenges related to faithfulness and hallucinations. To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives. PONTE models personalization as a closed-loop validation and adaptation process rather than prompt engineering. It combines: (i) a low-dimensional preference model capturing stylistic requirements; (ii) a preference-conditioned generator grounded in structured XAI artifacts; and (iii) verification modules enforcing numerical faithfulness, informational completeness, and stylistic alignment, optionally supported by retrieval-grounded argumentation. User feedback iteratively updates the preference state, enabling quick personalization. Automatic and human evaluations across healthcare and finance domains show that the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies further confirm strong agreement between intended preference vectors and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.

7. 【2603.06428】Abductive Reasoning with Syllogistic Forms in Large Language Models

链接：https://arxiv.org/abs/2603.06428

作者：Hirohiko Abe,Risako Ando,Takanobu Morishita Kentaro Ozeki,Koji Mineshima,Mitsuhiro Okada

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large-Language Models, rapidly evolving, key concern, Research, Models

备注： Published in Proceedings of the 3rd International Conference on Human and Artificial Rationalities (HAR 2024), LNCS 15504, pp. 3-17

点击查看摘要

Abstract:Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.

8. 【2603.06424】From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

链接：https://arxiv.org/abs/2603.06424

作者：Minh Hoang Nguyen,Vu Hoang Pham,Xuan Thanh Huynh,Phuc Hong Mai,Vinh The Nguyen,Quang Nhut Huynh,Huy Tien Nguyen,Tung Le

类目：Computation and Language (cs.CL)

关键词：Automated Essay Scoring, Large language models, reshaped Automated Essay, recently reshaped Automated, Essay Scoring

备注： 19 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at this https URL

9. 【2603.06416】Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

链接：https://arxiv.org/abs/2603.06416

作者：Hirohiko Abe,Kentaro Ozeki,Risako Ando,Takanobu Morishita,Koji Mineshima,Mitsuhiro Okada

类目：Computation and Language (cs.CL)

关键词：large language models, gaining increasing attention, language models, advance in linguistic, linguistic competence

备注： To appear in the Proceedings of EACL 2026

点击查看摘要

Abstract:As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts. Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored. In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs' conditional reasoning under deontic rules. We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule). Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors. Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.

10. 【2603.06348】ransparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI

链接：https://arxiv.org/abs/2603.06348

作者：Tanjim Taharat Aurpa

类目：Computation and Language (cs.CL)

关键词：challenging task due, Entity Relation Extraction, Mathematical text understanding, Mathematical Entity Relation, Bidirectional Encoder Representations

备注：

点击查看摘要

Abstract:Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model's predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.

11. 【2603.06333】SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

链接：https://arxiv.org/abs/2603.06333

作者：Subramanyam Sahoo,Aman Chadha,Vinija Jain,Divya Chaudhary

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：iterative self-modification risks, self-modification risks subtle, Goal Drift Index, risks subtle alignment, subtle alignment drift

备注： Published at ICLR 2026 Workshop on AI with Recursive Self-Improvement. 20 pages, 5 figures

点击查看摘要

Abstract:Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.

12. 【2603.06324】he Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

链接：https://arxiv.org/abs/2603.06324

作者：Anca Dinu,Andreiana Mihail,Andra-Maria Florescu,Claudiu Creanga

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：study explores artificial, images intentionally pastiching, artificial visual creativity, explores artificial visual, intentionally pastiching original

备注：

点击查看摘要

Abstract:This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such as paintings, drawings, sculptures and installations. The process involved twelve artists from Romania, Bulgaria, France, Austria, and the United Kingdom, each invited to contribute with three of their artworks and to grade and comment on the AI-generated versions. The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions. The results point to a significant gap between color and texture-based similarity and compositional, conceptual, and perceptual one. Consequently, we advocate for the use of a "style transfer dashboard" of complementary metrics to evaluate the similarity between pastiches and originals, rather than using a single style metric. The artists' comments revealed limitations of ChatGPT's pastiches after contemporary artworks, which were perceived by the authors of the originals as lacking dimensionality, context, and intentional sense, and seeming more of a paraphrase or an approximate quotation rather than as a valuable, emotion-evoking artwork.

13. 【2603.06290】he EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI

链接：https://arxiv.org/abs/2603.06290

作者：Giovanni Servedio,Potito Aghilar,Alessio Mattiace,Gianni Carmosino,Francesco Musicco,Gabriele Conte,Vito Walter Anelli,Tommaso Di Noia,Francesco Maria Donini

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Personal Artificial Intelligence, Artificial Intelligence, Personal Artificial, isolated silos, Personal Knowledge Graph

备注：

点击查看摘要

Abstract:Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user's digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.

14. 【2603.06264】Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

链接：https://arxiv.org/abs/2603.06264

作者：Hari Shankar,Vedanta S P,Sriharini Margapuri,Debjani Mazumder,Ponnurangam Kumaraguru,Abhijnan Chakraborty

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：predominantly English-centric training, English-centric training data, training data risks, data risks misalignment, Large Language Models

备注： 11 pages, including references

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM's internal representations, using log-probs/logits, to compare the model's opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.

15. 【2603.06222】SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

链接：https://arxiv.org/abs/2603.06222

作者：Yunlong Chu,Minglai Shao,Yuhang Liu,Bing Hao,Yumeng Lin,Jialu Wang,Ruijie Wang

类目：Computation and Language (cs.CL)

关键词：verbose token-level traces, incurs high inference, high inference cost, inference cost due, large language models

备注：

点击查看摘要

Abstract:Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.

16. 【2603.06199】FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

链接：https://arxiv.org/abs/2603.06199

作者：Qihang Fan,Huaibo Huang,Zhiying Wu,Juqiu Wang,Bingning Wang,Ran He

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Language Models, Large Language, compute-intensive prefilling phase, Long-context modeling

备注：

点击查看摘要

Abstract:Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

17. 【2603.06198】LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2603.06198

作者：Koki Itai,Shunichi Hasegawa,Yuta Yamamoto,Gouki Minegishi,Masaki Otsuki

类目：Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, Retrieval-Augmented Generation, Abstention RAG Generator, RAG Generator Benchmark

备注： Published as a conference paper at LREC 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at this https URL.

18. 【2603.06197】Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol Validation Using Eleven Large Language Models

链接：https://arxiv.org/abs/2603.06197

作者：Luis de-Marcos,Manuel Goyanes,Adrián Domínguez-Díaz

类目：Computation and Language (cs.CL)

关键词：Large-scale content analysis, extensive human coding, massive datasets due, Large-scale content, observable ground truth

备注：

点击查看摘要

Abstract:Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.

19. 【2603.06194】MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

链接：https://arxiv.org/abs/2603.06194

作者：Naifan Zhang,Ruihan Sun,Jinwei Su,Hengjie Yang,Zhengyuan Pan,Zhaohan Chen,Xiaofan Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：require conversational policies, evolving user states, long-horizon interaction quality, optimize long-horizon interaction, emotional support

备注：

点击查看摘要

Abstract:Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

20. 【2603.06183】CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

链接：https://arxiv.org/abs/2603.06183

作者：Mohammed Baharoon,Thibault Heintz,Siavash Raissi,Mahmoud Alabbad,Mona Alhammad,Hassan AlOmaish,Sung Eun Kim,Oishi Banerjee,Pranav Rajpurkar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：chest X-ray report, chest X-ray, X-ray report generation, contextual relevance, X-ray report

备注：

点击查看摘要

Abstract:We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at this https URL.

21. 【2603.06180】Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning

链接：https://arxiv.org/abs/2603.06180

作者：Claire Roman,Philippe Meyer

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Learning similarity metrics, scripts remain uncertain, fundamental challenge, uncertain and contested, similarity metrics

备注：

点击查看摘要

Abstract:Learning similarity metrics for glyphs and writing systems faces a fundamental challenge: while individual graphemes within invented alphabets can be reliably labeled, the historical relationships between different scripts remain uncertain and contested. We propose a two-stage framework that addresses this epistemological constraint. First, we train an encoder with contrastive loss on labeled invented alphabets, establishing a teacher model with robust discriminative features. Second, we extend to historically attested scripts through teacher-student distillation, where the student learns unsupervised representations guided by the teacher's knowledge but free to discover latent cross-script similarities. The asymmetric setup enables the student to learn deformation-invariant embeddings while inheriting discriminative structure from clean examples. Our approach bridges supervised contrastive learning and unsupervised discovery, enabling both hard boundaries between distinct systems and soft similarities reflecting potential historical influences. Experiments on diverse writing systems demonstrate effective few-shot glyph recognition and meaningful script clustering without requiring ground-truth evolutionary relationships.

22. 【2603.06164】Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

链接：https://arxiv.org/abs/2603.06164

作者：Ajinkya Kulkarni,Sandipana Dowerah,Atharva Kulkarni,Tanel Alumäe,Mathew Magimai Doss

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：prior work centers, Self-supervised learning, underpins modern audio, Representation Aware Pairwise-gated, Aware Pairwise-gated Transformer

备注： Submitted to Interspeech 2026, 4 pages, 2 figures

点击查看摘要

Abstract:Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.

23. 【2603.06135】A Causal Graph Approach to Oppositional Narrative Analysis

链接：https://arxiv.org/abs/2603.06135

作者：Diego Revilla,Martin Fernandez-de-Retana,Lingfeng Chen,Aritz Bilbao-Jayo,Miguel Fernandez-de-Retana

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：embedding human bias, Current methods, textual analysis rely, predefined ontologies, black-box models

备注：

点击查看摘要

Abstract:Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models. Despite achieving near-perfect performance, these approaches exploit unstructured, linear pattern recognition rather than modeling the structured interactions between entities that naturally emerge in discourse. In this work, we propose a graph-based framework for the detection, analysis, and classification of oppositional narratives and their underlying entities by representing narratives as entity-interaction graphs. Moreover, by incorporating causal estimation at the node level, our approach derives a causal representation of each contribution to the final classification by distilling the constructed sentence graph into a minimal causal subgraph. Building upon this representation, we introduce a classification pipeline that outperforms existing approaches to oppositional thinking classification task.

24. 【2603.06123】Diffusion Language Models Are Natively Length-Aware

链接：https://arxiv.org/abs/2603.06123

作者：Vittorio Rossi,Giacomo Cirò,Davide Beltrame,Luca Gandolfi,Paul Röttger,Dirk Hovy

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：autoregressive language models, Unlike autoregressive language, Diffusion Language Models, language models, Unlike autoregressive

备注：

点击查看摘要

Abstract:Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.

25. 【2603.06114】Making Implicit Premises Explicit in Logical Understanding of Enthymemes

链接：https://arxiv.org/abs/2603.06114

作者：Xuyao Feng,Anthony Hunter

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Real-world arguments, Real-world, Natural language processing, Natural language, text and dialogues

备注：

点击查看摘要

Abstract:Real-world arguments in text and dialogues are normally enthymemes (i.e. some of their premises and/or claims are implicit). Natural language processing (NLP) methods for handling enthymemes can potentially identify enthymemes in text but they do not decode their underlying logic, whereas logic-based approaches for handling them assume a knowledgebase with sufficient formulae that can be used to decode them via abduction. There is therefore a lack of a systematic method for translating textual components of an enthymeme into a logical argument and generating the logical formulae required for their decoding, and thereby showing logical entailment. To address this, we propose a pipeline that integrates: (1) a large language model (LLM) to generate intermediate implicit premises based on the explicit premise and claim; (2) another LLM to translate the natural language into logical formulas; and (3) a neuro-symbolic reasoner based on a SAT solver to determine entailment. We evaluate our pipeline on two enthymeme datasets, demonstrating promising performance in selecting the correct implicit premise, as measured by precision, recall, F1-score, and accuracy.

26. 【2603.06090】DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

链接：https://arxiv.org/abs/2603.06090

作者：Hao Yang,Hongbo Zhang,Yanyan Zhao,Bing Qin

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal large language, large language models, accurately interpret depth, achieved impressive performance, depth

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.

27. 【2603.06088】Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

链接：https://arxiv.org/abs/2603.06088

作者：Xi Wang,Mengdie Zhuang,Jiqun Liu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, largely prioritized uniform, favour specific behavioural, specific behavioural tendencies

备注：

点击查看摘要

Abstract:Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such as assertiveness. To investigate how diverse experiences shape machine personality and influence problem-solving, this study employs continued pre-training to expose models to domain-specific texts in an unsupervised manner, simulating the accumulation of experience. By adapting the Big Five framework via the Machine Personality Inventory (MPI), we quantify the personality traits of these model variants and analyse their relationship to linguistic style and reasoning behaviour. The findings reveal that model competence is bimodal, peaking at "Expressive Generalists" and "Suppressed Specialists," while identifying a "Suppression Advantage" where reduced social traits enhance complex reasoning performance. This study further establishes a causal link between training data linguistics, such as imperative frequency, and lexical diversity, providing a roadmap for "Personality Engineering".

28. 【2603.06066】Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

链接：https://arxiv.org/abs/2603.06066

作者：Jonas Kubesch,Lena Huber,Clemens Havas

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Automated Essay Scoring, mitigating subjective biases, reducing grading workload, Large Language Models, Automated Essay

备注： To be presented at the SAC2026 and published in its symposium proceedings

点击查看摘要

Abstract:Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.

29. 【2603.06024】ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

链接：https://arxiv.org/abs/2603.06024

作者：Xingjian Tao,Yiwei Wang,Yujun Cai,Yifan Song,Jing Tang

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：current vision-language models, Multi-view spatial reasoning, reasoning remains difficult, Multi-view spatial, remains difficult

备注：

点击查看摘要

Abstract:Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

30. 【2603.06007】MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

链接：https://arxiv.org/abs/2603.06007

作者：Yang Liu,Jinxuan Cai,Yishen Li,Qi Meng,Zedi Liu,Xin Li,Chen Qian,Chuan Shi,Cheng Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词：Large language model-based, extend agentic problem, agentic problem solving, Large language, multi-agent systems

备注： Submitted to ACL 2026 Demo Track. 10 pages, 6 figures. Code and documentation are available at: [this https URL](https://github.com/BUPT-GAMMA/MASFactory)

点击查看摘要

Abstract:Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (this https URL) and video (this https URL) are publicly available.

31. 【2603.05996】rack-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL

链接：https://arxiv.org/abs/2603.05996

作者：Bingfeng Chen,Shaobin Shi,Yongqi Luo,Boyan Xu,Ruichu Cai,Zhifeng Hao

类目：Computation and Language (cs.CL)

关键词：shown significant potential, potential in single-turn, Generative language models, Generative language, shown significant

备注： Accepted at the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025), Long Paper, 19 pages

点击查看摘要

Abstract:Generative language models have shown significant potential in single-turn Text-to-SQL. However, their performance does not extend equivalently to multi-turn Text-to-SQL. This is primarily due to generative language models' inadequacy in handling the complexities of context information and dynamic schema linking in multi-turn interactions. In this paper, we propose a framework named Track-SQL, which enhances generative language models with dual-extractive modules designed to track schema and contextual changes in multi-turn Text-to-SQL. Specifically, Track-SQL incorporates a \emph{Semantic-enhanced Schema Extractor} and a \emph{Schema-aware Context Extractor}. Experimental results demonstrate that Track-SQL achieves state-of-the-art performance on the SparC and CoSQL datasets. Furthermore, detailed ablation studies reveal that Track-SQL significantly improves execution accuracy in multi-turn interactions by 7.1\% and 9.55\% on these datasets, respectively. Our implementation will be open-sourced at this https URL.

32. 【2603.05969】Imagine How To Change: Explicit Procedure Modeling for Change Captioning

链接：https://arxiv.org/abs/2603.05969

作者：Jiayang Sun,Zixin Guo,Min Cao,Guibo Zhu,Jorma Laaksonen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：visually similar images, captioning generates descriptions, Change captioning generates, generates descriptions, descriptions that explicitly

备注： Accepted to ICLR 2026. Code and models are available at [this https URL](https://github.com/BlueberryOreo/ProCap)

点击查看摘要

Abstract:Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder-decoder model for captioning. Instead of relying on explicit frames from the previous stage -- a process incurring computational overhead and sensitivity to visual noise -- we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre-trained models are available at this https URL

33. 【2603.05953】Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

链接：https://arxiv.org/abs/2603.05953

作者：Nikita Soni,August Håkan Nilsson,Syeda Mahwish,Vasudha Varadarajan,H. Andrew Schwartz,Ryan L. Boyd

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：dynamic process shaped, process shaped, interplay between individual, individual dispositions, situational contexts

备注：

点击查看摘要

Abstract:Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.

34. 【2603.05933】Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling

链接：https://arxiv.org/abs/2603.05933

作者：Chanhui Zhu

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Large Language Models, small Language Models, Large Language, demonstrated impressive capabilities, small Language

备注： 26 pages, 4 figures. Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to "Out-Of-Character" (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model's latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.

35. 【2603.05928】Addressing the Ecological Fallacy in Larger LMs with Human Context

链接：https://arxiv.org/abs/2603.05928

作者：Nikita Soni,Dhruv Vijay Kunjadiya,Pratham Piyush Shah,Dikshya Mohanty,H. Andrew Schwartz,Niranjan Balasubramanian

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：fundamental linguistic fact, linguistic fact, ecological fallacy, inference ignore, ignore a fundamental

备注：

点击查看摘要

Abstract:Language model training and inference ignore a fundamental linguistic fact -- there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textit{ecological fallacy} can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author's language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author's language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textit{HuFT:Human-aware Fine-Tuning}). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.

36. 【2603.05923】Learning Next Action Predictors from Human-Computer Interaction

链接：https://arxiv.org/abs/2603.05923

作者：Omar Shaikh,Valentin Teutschbein,Kanishk Gandhi,Yikun Chi,Nick Haber,Thomas Robinson,Nilam Ram,Byron Reeves,Sherry Yang,Michael S. Bernstein,Diyi Yang

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：proactive AI systems, user, LongNAP, data, traces

备注： 32 pages, 10 figures, see [this https URL](https://generalusermodels.github.io/nap)

点击查看摘要

Abstract:Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

37. 【2603.05909】InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning

链接：https://arxiv.org/abs/2603.05909

作者：Maksym Taranukhin,Shuyue Stella Li,Evangelos Milios,Geoff Pleiss,Yulia Tsvetkov,Vered Shwartz

类目：Computation and Language (cs.CL)

关键词：generates a prediction, increasingly deployed, deployed in high-stakes, document-grounded QA systems, LLM generates

备注： Under review

点击查看摘要

Abstract:LLMs are increasingly deployed in high-stakes domains such as medical triage and legal assistance, often as document-grounded QA systems in which a user provides a description, relevant sources are retrieved, and an LLM generates a prediction. In practice, initial user queries are often underspecified, and a single retrieval pass is insufficient for reliable decision-making, leading to incorrect and overly confident answers. While follow-up questioning can elicit missing information, existing methods typically depend on implicit, unstructured confidence signals from the LLM, making it difficult to determine what remains unknown, what information matters most, and when to stop asking questions. We propose InfoGatherer, a framework that gathers missing information from two complementary sources: retrieved domain documents and targeted follow-up questions to the user. InfoGatherer models uncertainty using Dempster-Shafer belief assignments over a structured evidential network, enabling principled fusion of incomplete and potentially contradictory evidence from both sources without prematurely collapsing to a definitive answer. Across legal and medical tasks, InfoGatherer outperforms strong baselines while requiring fewer turns. By grounding uncertainty in formal evidential theory rather than heuristic LLM signals, InfoGatherer moves towards trustworthy, interpretable decision support in domains where reliability is critical.

38. 【2603.05895】Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

链接：https://arxiv.org/abs/2603.05895

作者：Hussein Ghaly

类目：Computation and Language (cs.CL)

关键词：Security Council resolutions, Security Council, Council resolutions, efficient semantic tagging, semantic tagging

备注：

点击查看摘要

Abstract:This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.

39. 【2603.05890】Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

链接：https://arxiv.org/abs/2603.05890

作者：Junjie Li,Xinrui Guo,Yuhao Wu,Roy Ka-Wei Lee,Hongzhi Li,Yutao Xie

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, storyteller forgets, Language Models, Large Language, consistency

备注：

点击查看摘要

Abstract:What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at this https URL.

40. 【2603.05883】VerChol -- Grammar-First Tokenization for Agglutinative Languages

链接：https://arxiv.org/abs/2603.05883

作者：Prabhu Raja

类目：Computation and Language (cs.CL)

关键词：Byte Pair Encoding, inherently script agnostic, dominant approach Byte, large language model, approach Byte Pair

备注： 13 pages. A Morphological Alternative to Statistical Subword Tokenization

点击查看摘要

Abstract:Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For agglutinative languages a typological class encompassing the Dravidian family (Tamil, Kannada, Telugu, Malayalam), Turkic languages (Turkish, Azerbaijani, Uzbek), Uralic languages (Finnish, Hungarian, Estonian), Korean, Japanese, Swahili, Basque, and others, a single word may encode root, tense, aspect, person, number, gender agreement, case, and postpositions into one orthographic unit. Statistical tokenizers fragment these words into byte pair chunks that sever morpheme boundaries and inflate token counts.

41. 【2603.05881】Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

链接：https://arxiv.org/abs/2603.05881

作者：Changcheng Li,Jiancan Wu,Hengheng Zhang,Zhengsu Chen,Guo An,Junxiang Qiu,Xiang Wang,Qi Tian

类目：Computation and Language (cs.CL)

关键词：Reliable deployment, accurate uncertainty estimation, requires accurate uncertainty, large language models, requires accurate

备注：

点击查看摘要

Abstract:Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.05881 [cs.CL]

(or
arXiv:2603.05881v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.05881

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

42. 【2603.05878】ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

链接：https://arxiv.org/abs/2603.05878

作者：Mingluo Su,Huan Wang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：large language models, potentially leading, deployment and inference, widely recognized, reducing the parameters

备注： CPAL 2026 oral

点击查看摘要

Abstract:Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at this https URL.

43. 【2603.05863】ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

链接：https://arxiv.org/abs/2603.05863

作者：Juyong Jiang,Jiasi Shen,Sunghun Kim,Kang Min Yoo,Jeonghoon Kim,Sungju Kim

类目：Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：Large Language Models, single forward pass, complex algorithmic tasks, Large Language, generating solutions

备注：

点击查看摘要

Abstract:While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at this https URL.

44. 【2603.05829】st-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

链接：https://arxiv.org/abs/2603.05829

作者：Shubhangi Upasani,Chen Wu,Jay Rainton,Bo Li,Changran Hu,Qizheng Zhang,Urmish Thakker

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：enables large language, updating model parameters, inference without updating, adaptation enables large, large language models

备注：

点击查看摘要

Abstract:Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.

45. 【2603.05828】HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models

链接：https://arxiv.org/abs/2603.05828

作者：Shize Liang,Hongzhi Wang

类目：Computation and Language (cs.CL)

关键词：knowledge-intensive question answering, demonstrated remarkable performance, Large language models, question answering, demonstrated remarkable

备注：

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external knowledge retrieval or internal model mechanisms, primarily focus on semantic similarity matching or representation-level discrimination. As a result, they have difficulty establishing structured correspondences at the span level between hallucination types, underlying error generation mechanisms, and external factual evidence, thereby limiting the interpretability of hallucinated fragments and the traceability of supporting or opposing evidence. To address these limitations, we propose HART, a fine-grained hallucination attribution and evidence retrieval framework for large language models. HART formalizes hallucination tracing as a structured modeling task comprising four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing. Based upon this formulation, we develop the first structured dataset tailored for hallucination tracing, in which hallucination types, error mechanisms, and sets of counterfactual evidence are jointly annotated to enable causal-level interpretability evaluation. Experimental results on the proposed dataset demonstrate that HART substantially outperforms strong retrieval baselines, including BM25 and DPR, validating the effectiveness and generalization capability of the proposed tracing paradigm for hallucination analysis and evidence alignment.

46. 【2603.05818】RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

链接：https://arxiv.org/abs/2603.05818

作者：Yuhang Liu,Ruijie Wang,Yunlong Chu,Bing Hao,Yumeng Lin,Shengzhong Liu,Minglai Shao

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, improve system-level returns, consistently improve system-level, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1\% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.

47. 【2603.05786】Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

链接：https://arxiv.org/abs/2603.05786

作者：Xisen Jin,Michael Duan,Qin Lin,Aaron Chan,Zhenglun Chen,Junyi Du,Xiang Ren

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：online services, falsely advertised, widely deployed, deployed as online, measures are falsely

备注： 8 pages

点击查看摘要

Abstract:As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: this https URL

48. 【2603.05778】utor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring

链接：https://arxiv.org/abs/2603.05778

作者：Zhuqian Zhou,Kirk Vanacore,Tamisha Thompson,Jennifer St John,Rene Kizilcec

类目：Computation and Language (cs.CL)

关键词：effective requires methods, Understanding what makes, systematically analyzing tutors', National Tutoring Observatory, makes tutoring effective

备注：

点击查看摘要

Abstract:Understanding what makes tutoring effective requires methods for systematically analyzing tutors' instructional actions during learning interactions. This paper presents a tutor move taxonomy designed to support large-scale analysis of tutoring dialogue within the National Tutoring Observatory. The taxonomy provides a structured annotation framework for labeling tutors' instructional moves during one-on-one tutoring sessions. We developed the taxonomy through a hybrid deductive-inductive process. First, we synthesized research from cognitive science, the learning sciences, classroom discourse analysis, and intelligent tutoring systems to construct a preliminary framework of tutoring moves. We then refined the taxonomy through iterative coding of authentic tutoring transcripts conducted by expert annotators with extensive instructional and qualitative research experience. The resulting taxonomy organizes tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support. Learning support moves are further organized along a spectrum of student engagement, distinguishing between moves that elicit student reasoning and those that provide direct explanation or answers. By defining tutoring dialogue in terms of discrete instructional actions, the taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes.

49. 【2603.05776】PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models

链接：https://arxiv.org/abs/2603.05776

作者：Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Ashley Hagaman,Sarah Lowe,Aimee Roundtree

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：patients' lived experiences, strongly influence adherence, care coordination, lived experiences, including factors

备注：

点击查看摘要

Abstract:Motivation: Patient-generated text contains critical information about patients' lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.05776 [cs.CL]

(or
arXiv:2603.05776v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.05776

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Samah Fodeh [view email] [v1]
Fri, 6 Mar 2026 00:16:05 UTC (955 KB)

50. 【2603.05750】NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

链接：https://arxiv.org/abs/2603.05750

作者：Genet Asefa Gesese,Zongxiong Chen,Shufan Jiang,Mary Ann Tan,Zhaotai Liu,Sonja Schimmler,Harald Sack

类目：Computation and Language (cs.CL)

关键词：Existing scholarly information, Existing scholarly, scholarly information extraction, overlook implementation-level details, information extraction difficult

备注： To be published (Accepted at WWW'26)

点击查看摘要

Abstract:Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

51. 【2603.05744】CodeScout: Contextual Problem Statement Enhancement for Software Agents

链接：https://arxiv.org/abs/2603.05744

作者：Manan Suri,Xiangci Li,Mehdi Shojaie,Songyang Han,Chao-Chun Hsu,Shweta Garg,Aniket Anand Deshmukh,Varun Kumar

类目：Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：Current AI-powered code, Current AI-powered, lack sufficient task, requirements specification, tools often struggle

备注：

点击查看摘要

Abstract:Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre-exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre-exploration directly addresses the identified failure patterns by reducing non-converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state-of-the-art agentic scaffolds and language models on SWEBench-Verified, demonstrating a 20\% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.

52. 【2603.05743】Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní

链接：https://arxiv.org/abs/2603.05743

作者：Samantha Adorno,Akshata Kishore Moharir,Ratna Kandala

类目：Computation and Language (cs.CL)

关键词：remains predominantly text-first, underserving primarily oral, artificial intelligence, universal solutions, predominantly text-first

备注：

点击查看摘要

Abstract:Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaraní, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard "text-to-speech" pipeline, proposing instead an oral-first multi-agent architecture. By decoupling Guaraní natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.

53. 【2603.05727】Structured Multidimensional Representation Learning for Large Language Models

链接：https://arxiv.org/abs/2603.05727

作者：Alaa El Ichi,Khalide Jbilou,Mohamed El Guide,Franck Dufrenois

类目：Computation and Language (cs.CL); Numerical Analysis (math.NA)

关键词：language processing tasks, natural language processing, Transformer architectures achieve, substantial parameter growth, Tensor Transformer architecture

备注： 25 pages, 6 figures. Preprint of a journal submission

点击查看摘要

Abstract:Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

54. 【2603.05723】Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach

链接：https://arxiv.org/abs/2603.05723

作者：Erin van Liemt,Renee Shelby,Andrew Smart,Sinchana Kumbale,Richard Zhang,Neha Dixit,Qazi Mamunur Rashid,Jamila Smith-Loud

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：lack of empirical, empirical evidence, global attitudes, represent cultures, GenAI

备注： 21 pages, 5 figures, 6 tables

点击查看摘要

Abstract:There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of its conceptual complexities and how they relate to representations in Generative AI. We survey from across parts of Europe, North and South America, Asia, and Africa. We conclude with a set of recommendations for Culture and GenAI development. These include participatory approaches, prioritizing specific cultural dimensions beyond geography, such as religion and tradition, and a sensitivity framework for addressing cultural ``redlines''.

55. 【2603.05698】owards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis

链接：https://arxiv.org/abs/2603.05698

作者：Hazem Amamou,Stéphane Gagnon,Alan Davoust,Anderson R. Avila

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, capabilities of Large, encoded prior knowledge

备注： The paper is 6 pages long and includes 5 figures and 3 tables illustrating the experimental framework and results. It is submitted to the IEEE International Conference on Systems, Man, and Cybernetics (SMC 2025) and studies improving the robustness of Retrieval-Augmented Generation systems using knowledge graph based GraphRAG approaches

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.

56. 【2603.05696】Autonomous Algorithm Discovery for Ptychography via Evolutionary LLM Reasoning

链接：https://arxiv.org/abs/2603.05696

作者：Xiangyu Yin,Ming Du,Junjing Deng,Zhi Yang,Yimo Han,Yi Jiang

类目：Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA)

关键词：high-resolution materials characterization, remain manually designed, largely remain manually, computational imaging technique, imaging technique widely

备注：

点击查看摘要

Abstract:Ptychography is a computational imaging technique widely used for high-resolution materials characterization, but high-quality reconstructions often require the use of regularization functions that largely remain manually designed. We introduce Ptychi-Evolve, an autonomous framework that uses large language models (LLMs) to discover and evolve novel regularization algorithms. The framework combines LLM-driven code generation with evolutionary mechanisms, including semantically-guided crossover and mutation. Experiments on three challenging datasets (X-ray integrated circuits, low-dose electron microscopy of apoferritin, and multislice imaging with crosstalk artifacts) demonstrate that discovered regularizers outperform conventional reconstructions, achieving up to +0.26 SSIM and +8.3~dB PSNR improvements. Besides, Ptychi-Evolve records algorithm lineage and evolution metadata, enabling interpretable and reproducible analysis of discovered regularizers.

57. 【2603.05690】FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

链接：https://arxiv.org/abs/2603.05690

作者：Hung Nguyen Huy,Mo El-Haj,Dawn Knight,Paul Rayson

类目：Computation and Language (cs.CL)

关键词：open source web, source web based, English text collections, analysing bilingual Vietnamese, open source

备注： 10 pages

点击查看摘要

Abstract:FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.

58. 【2603.05651】he Fragility Of Moral Judgment In Large Language Models

链接：https://arxiv.org/abs/2603.05651

作者：Tom van Nuenen,Pratik S. Sachdeva

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：interrogate missing context, People increasingly, LLM moral judgments, LLM moral, interpersonal guidance

备注： 22 pages, 7 figures, 10 tables, plus appendices

点击查看摘要

Abstract:People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

Comments:
22 pages, 7 figures, 10 tables, plus appendices

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

ACMclasses:
I.2.7; K.4.1; H.5.2

Cite as:
arXiv:2603.05651 [cs.CL]

(or
arXiv:2603.05651v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.05651

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

59. 【2603.05621】RACAS: Controlling Diverse Robots With a Single Agentic System

链接：https://arxiv.org/abs/2603.05621

作者：Dylan R. Ashley,Jan Przepióra,Yimeng Chen,Ali Abualsaud,Nurzhan Yesmagambet,Shinkyu Park,Eric Feron,Jürgen Schmidhuber

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：expose an API, read their sensors, external software, software can command, command their actuators

备注： 7 pages in main text + 1 page of appendices + 1 page of references, 5 figures in main text + 1 figure in appendices, 2 tables in main text

点击查看摘要

Abstract:Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.

60. 【2603.05618】Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

链接：https://arxiv.org/abs/2603.05618

作者：Patrick Ahrend,Tobias Eder,Xiyang Yang,Zhiyi Pan,Georg Groh

类目：Computation and Language (cs.CL)

关键词：prompting improves LLM, personally identifiable information, improves LLM reasoning, resurfacing personally identifiable, improves LLM

备注：

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.

61. 【2603.05617】NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution

链接：https://arxiv.org/abs/2603.05617

作者：Oleksandr Marchenko Breneur,Adelaide Danilov,Aria Nourbakhsh,Salima Lamsiyah

类目：Computation and Language (cs.CL)

关键词：Conditional Probability Curvature, machine-generated text detection, integrating curvature-based signals, including Conditional Probability, http URL

备注： 8 pages, 7 figures

点击查看摘要

Abstract:We present this http URL, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, this http URL applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.

62. 【2603.05569】CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-based Reasoning in the Healthcare Domain

链接：https://arxiv.org/abs/2603.05569

作者：Hung Nguyen,Hans Moen,Pekka Marttinen

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Electronic Health Record, requires SQL expertise, Health Record, Electronic Health, Large Language Models

备注：

点击查看摘要

Abstract:Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address this, we introduce CBR-to-SQL, a framework inspired by Case-Based Reasoning (CBR). It represents question-SQL pairs as reusable, abstract case templates and utilizes a two-stage retrieval process that first captures logical structure and then resolves relevant entities. Evaluated on MIMICSQL, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. More importantly, it demonstrates higher sample efficiency and robustness than standard RAG approaches, particularly under data scarcity and retrieval perturbations.

63. 【2603.05566】Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

链接：https://arxiv.org/abs/2603.05566

作者：Xiang Ma,Lexin Fang,Litian Xu,Caiming Zhang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：multimodal learning aimed, achieving semantic consistency, vision and language, crucial task, task in multimodal

备注： AAAI 2026 poster

点击查看摘要

Abstract:Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via \textbf{C}onstrained \textbf{D}ecoupling and \textbf{D}istribution \textbf{S}ampling (CDDS). Specifically, (1) A dual-path UNet is introduced to adaptively decouple the embeddings, applying multiple constraints to ensure effective separation. (2) A distribution sampling method is proposed to bridge the modality gap, ensuring the rationality of the alignment process. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of CDDS, outperforming state-of-the-art methods by 6.6\% to 14.2\%.

64. 【2603.05553】EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

链接：https://arxiv.org/abs/2603.05553

作者：Jiaao Chen,Jingyuan Qi,Mingye Gao,Wei-Chen Wang,Hanrui Wang,Di Jin

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：domain-specific training data, training data spanning, large language models, data spanning executable, spanning executable environments

备注：

点击查看摘要

Abstract:Function-calling agents -- large language models that invoke tools and APIs -- require high-quality, domain-specific training data spanning executable environments, backing databases, and diverse multi-turn trajectories. We introduce EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture. A top-level orchestrator, EigenCore, coordinates three specialized sub-systems: DatabaseAgent for realistic domain database construction, CodingAgent for verified executable environment generation with iterative test-debug loops, and DataAgent for multi-turn trajectory synthesis with self-evolving prompt optimization. Cross-component feedback ensures consistency across all artifacts. We apply EigenData to audit and repair the Berkeley Function-Calling Leaderboard (BFCL-V3), identifying systematic errors in function schemas, implementations, and reference trajectories, automatically correcting them through coordinated schema refinement, code-level bug fixes, and trajectory modification, and introducing an outcome-aware evaluation protocol that assesses task success via database-state correctness rather than turn-level trajectory matching. We demonstrate that the repaired benchmark, coupled with outcome-aware metrics, produces model rankings substantially better correlated with human judgments of functional correctness.

65. 【2603.05540】Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding

链接：https://arxiv.org/abs/2603.05540

作者：Faruk Alpay,Bilge Senturk

类目：Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)

关键词：autoregressive next-token distribution, pushdown system compiled, study grammar-constrained decoding, study grammar-constrained, pushdown system

备注： 20 pages

点击查看摘要

Abstract:We study grammar-constrained decoding (GCD) as a coupling between an autoregressive next-token distribution and a reachability oracle over a pushdown system compiled from a context-free grammar (CFG). We prove an oracle invariance theorem: language-equivalent grammars induce identical admissible next-token sets for every prefix, hence identical logit masks, yet can yield provably different compiled state spaces and online ambiguity costs. We give exact control-state blowup counts for the canonical $a^n b^n$ language under redundant nonterminal delegation, and introduce a left-to-right structural ambiguity cost (SAC) measuring incremental packed-parse-forest growth per token. For two equivalent grammars over all finite strings, SAC is $O(1)$ per token under right-recursion but $\Theta(t^2)$ per token and $\Theta(n^3)$ cumulatively under concatenation. We establish engine-independent lower bounds: any sound, retrieval-efficient, parse-preserving online masking engine must incur $\Omega(t^2)$ work per token on a specific constant-size CFG family, unconditionally within this model. We define decoding-cost equivalence classes of grammars and prove existence of minimal-SAC representatives within bounded rewrite families. Finally, we characterize the true conditional sampler via a Doob $h$-transform and derive sharp one-step KL and total-variation distortion bounds for hard-masked decoding in terms of survival-probability spread among admissible next tokens. We integrate these results with Transformer and Mixture-of-Experts architectures, derive latency envelopes in terms of vocabulary size, active state sets, and beam width, and connect SAC to instrumentation-based predictive performance models and automated grammar optimization.

66. 【2603.05528】Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

链接：https://arxiv.org/abs/2603.05528

作者：Kin Wai Lau,Yasar Abbas Ur Rehman,Lai-Man Po,Pedro Porto Buarque de Gusmão

类目：Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：linearly scaling complexity, Recent multimodal systems, Recent multimodal, rely on separate, linearly scaling

备注：

点击查看摘要

Abstract:Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities--images, audio, and text--through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.

67. 【2603.05519】Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection

链接：https://arxiv.org/abs/2603.05519

作者：Dorsaf Sallami,Esma Aïmeur

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词：digital age poses, democratic institutions, Large Language Models, rampant spread, digital age

备注：

点击查看摘要

Abstract:The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system's usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.

68. 【2603.05193】ransducing Language Models

链接：https://arxiv.org/abs/2603.05193

作者：Vésteinn Snæbjarnarson,Samuel Kiegeland,Tianyu Liu,Reda Boumasmoud,Ryan Cotterell,Tim Vieira

类目：Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

关键词：Modern language models, Modern language, language models, downstream tasks, tasks often require

备注：

点击查看摘要

Abstract:Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.

69. 【2402.06204】he Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

链接：https://arxiv.org/abs/2402.06204

作者：Juhyun Oh,Eunsu Kim,Inha Cha,Alice Oh

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：Large Language Models, Large Language, assumption that Large, Language Models, equally adept

备注：

点击查看摘要

Abstract:This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.

70. 【2603.06310】Continual Adaptation for Pacific Indigenous Speech Recognition

链接：https://arxiv.org/abs/2603.06310

作者：Yang Xiao,Aso Mahmudi,Nick Thieberger,Eliathamby Ambikairajah,Eun-Jung Holden,Ting Dang

类目：Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词：low-resource Pacific Indigenous, Speech foundation models, Speech foundation, Pacific Indigenous languages, Pacific Indigenous

备注： Submitted to Interspeech

点击查看摘要

Abstract:Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate how data volume and linguistic features affect adaptation success. Specifically, we evaluate strategies including Full Fine-Tuning and Low-Rank Adaptation (LoRA). Additionally, we analyze a continual learning framework for sequentially acquiring multiple languages. We demonstrate that adapting to these distant languages causes severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.

信息检索

1. 【2603.06397】Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion

链接：https://arxiv.org/abs/2603.06397

作者：Pengcheng Jiang,Judith Yue Li,Moonkyung Ryu,R. Lily Hu,Kun Su,Zhong Yi Wan,Liam Hebert,Hao Peng,Jiawei Han,Dima Kuzmin,Craig Boutilier

类目：Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Toggle, modern retrieval problems, optimizes higher-order properties, retrieval, Code Toggle Papers

备注：

点击查看摘要

Abstract:Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate diverse subqueries to retrieve item sets. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at inference time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across large-scale fashion and music benchmarks consisting of curated item sets, we show that R4T improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.

Subjects:

Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2603.06397 [cs.IR]

(or
arXiv:2603.06397v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.06397

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Judith Yue Li [view email] [v1]
Fri, 6 Mar 2026 15:42:33 UTC (7,338 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion, by Pengcheng Jiang and 10 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.IR

|
next

new
|
recent
| 2026-03

Change to browse by:

cs
cs.LG

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

计算机视觉

1. 【2603.06578】Multimodal Large Language Models as Image Classifiers

链接：https://arxiv.org/abs/2603.06578

作者：Nikita Kisel,Illia Volkov,Klara Janouskova,Jiri Matas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, performance depends critically

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.

2. 【2603.06577】Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

链接：https://arxiv.org/abs/2603.06577

作者：Lijiang Li,Zuwei Long,Yunhang Shen,Heting Gao,Haoyu Cao,Xing Sun,Caifeng Shan,Ran He,Chaoyou Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made impressive strides, conventional autoregressive architecture, leaving significant room, discrete diffusion models, large language models

备注： Project page: [this https URL](https://omni-diffusion.github.io)

点击查看摘要

Abstract:While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: this https URL.

3. 【2603.06576】BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

链接：https://arxiv.org/abs/2603.06576

作者：Thomas Monninger,Shaoyuan Xie,Qi Alfred Chen,Sihao Ding

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：Large Language Models, Language Models, Large Language, attracted growing interest, handling complex decision-making

备注： 4 figures, 6 tables in the main paper, 32 pages in total

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

4. 【2603.06572】SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

链接：https://arxiv.org/abs/2603.06572

作者：Vishal Thengane,Zhaochong An,Tianjin Huang,Son Lam Phung,Abdesselam Bouzerdoum,Lu Yin,Na Zhao,Xiatian Zhu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Incremental Few-Shot, IFS, Incremental, aims to learn, segmentation aims

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available this https URL.

5. 【2603.06570】SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

链接：https://arxiv.org/abs/2603.06570

作者：Alejandra Perez,Anita Rau,Lee White,Busisiwe Mlambo,Chinedu Nwoye,Muhammad Abdullah Jamal,Omid Mohareri

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：surgical, Surgeons, SUREON, reasoning, Relative Policy Optimization

备注：

点击查看摘要

Abstract:Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

6. 【2603.06569】Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

链接：https://arxiv.org/abs/2603.06569

作者：Boqiang Zhang,Lei Ke,Ruihan Yang,Qi Gao,Tianyuan Qu,Rossell Chen,Dong Yu,Leoweiliang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Language Model, Vision Language, Language Model, development has largely, smartphones and robots

备注： Penguin-VL Technical Report; Code: [this https URL](https://github.com/tencent-ailab/Penguin-VL)

点击查看摘要

Abstract:Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: this https URL

7. 【2603.06561】EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

链接：https://arxiv.org/abs/2603.06561

作者：Fangrui Zhu,Yunfeng Xi,Jianmo Ni,Mu Cai,Boqing Gong,Long Zhao,Chen Qu,Ian Miao,Yi Li,Cheng Zhong,Huaizu Jiang,Shwetak Patel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inherently complex due, Egocentric video understanding, object displacements necessitate, video understanding, understanding is inherently

备注： preprint

点击查看摘要

Abstract:Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.

8. 【2603.06544】Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

链接：https://arxiv.org/abs/2603.06544

作者：Yuhan Zhou,Mehri Sattari,Haihua Chen,Kewei Sha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Next-generation autonomous vehicles, support real-time decision-making, Next-generation autonomous, autonomous vehicles, rely on large

备注： This paper has been accepted by the Fourth IEEE International Conference on Mobility: Operations, Services, and Technologies (MOST) 2026

点击查看摘要

Abstract:Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6\%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: this https URL

9. 【2603.06543】SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference

链接：https://arxiv.org/abs/2603.06543

作者：Ashkan Shahbazi,Elaheh Akbari,Kyvia Pereira,Jon S. Heiselman,Annie C. Benson,Garrison L. H. Johnston,Jie Ying Wu,Nabil Simaan,Michael I. Miga,Soheil Kolouri

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multiresolution gated transformer, driven soft tissue, data driven soft, soft tissue simulation, multiresolution gated

备注：

点击查看摘要

Abstract:We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical solvers are often too costly for interactive use, so we train SurgFormer on solver generated data to predict nodewise displacement fields at near real time rates. SurgFormer builds a fixed mesh hierarchy and applies repeated multibranch blocks that combine local message passing, coarse global self attention, and pointwise feedforward updates, fused by learned per node, per channel gates to adaptively integrate local and long range information while remaining scalable on large meshes. For cut conditioned simulation, resection information is encoded as a learned cut embedding and provided as an additional input, enabling a unified model for both standard deformation prediction and topology altering cases. We also introduce two surgical simulation datasets generated under a unified protocol with XFEM based supervision: a cholecystectomy resection dataset and an appendectomy manipulation and resection dataset with cut and uncut cases. To our knowledge, this is the first learned volumetric surrogate setting to study XFEM supervised cut conditioned deformation within the same volumetric pipeline as standard deformation prediction. Across diverse baselines, SurgFormer achieves strong accuracy with favorable efficiency, making it a practical backbone for both tasks. {Code, data, and project page: \href{this https URL}{available here}}

10. 【2603.06533】NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

链接：https://arxiv.org/abs/2603.06533

作者：Taewon Kang,Ming C. Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains inadequately modeled, fundamental linguistic operator, diffusion-based generative models, diffusion-based generative, diffusion-based generative systems

备注： 50 pages, 32 figures

点击查看摘要

Abstract:Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.

11. 【2603.06531】Spatial Calibration of Diffuse LiDARs

链接：https://arxiv.org/abs/2603.06531

作者：Nikhil Behari,Ramesh Raskar

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：depth histograms formed, aggregating photon returns, wide instantaneous field, standard LiDAR-RGB calibration, report per-pixel depth

备注：

点击查看摘要

Abstract:Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.

12. 【2603.06530】AV-Unified: A Unified Framework for Audio-visual Scene Understanding

链接：https://arxiv.org/abs/2603.06530

作者：Guangyao Li,Xin Wang,Wenwu Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：naturally integrate multiple, integrate multiple audio-visual, perceive the world, humans perceive, naturally integrate

备注： Accepted by IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.

13. 【2603.06523】SCAN: Visual Explanations with Self-Confidence and Analysis Networks

链接：https://arxiv.org/abs/2603.06523

作者：Gwanghee Lee,Sungyoon Jeong,Kyoungson Jhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：learning models transparent, deep learning models, essential in computer, computer vision, deep learning

备注： 14 pages, 9 figures, IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model's intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.

14. 【2603.06522】Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

链接：https://arxiv.org/abs/2603.06522

作者：Yuanji Zhang,Yuhao Huang,Haoran Dou,Xiliang Zhu,Chen Ling,Zhong Yang,Lianying Liang,Jiuping Li,Siying Liang,Rui Li,Yan Cao,Yuhan Zhang,Jiewei Lai,Yongsong Zhou,Hongyu Zheng,Xinru Gao,Cheng Yu,Liling Shi,Mengqin Yuan,Honglong Li,Xiaoqiong Huang,Chaoyu Chen,Jialin Zhang,Wenxiong Pan,Alejandro F. Frangi,Guangzhi He,Xin Yang,Yi Xiong,Linliang Yin,Xuedong Deng,Dong Ni

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：congenital craniofacial abnormalities, common congenital craniofacial, accurate prenatal detection, prenatal detection remains, detection remains challenging

备注： 28 pages, 10 figures, 11 tables

点击查看摘要

Abstract:Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcity of experienced specialists and the relative rarity of the condition. Early and reliable diagnosis is essential to enable timely clinical intervention and reduce associated morbidity. Here we show that an artificial intelligence system, trained on over 45,139 ultrasound images from 9,215 fetuses across 22 hospitals, can diagnose fetal orofacial clefts with sensitivity and specificity exceeding 93% and 95% respectively, matching the performance of senior radiologists and substantially outperforming junior radiologists. When used as a medical copilot, the system raises junior radiologists' sensitivity by more than 6%. Beyond direct diagnostic assistance, the system also accelerates the development of clinical expertise. A pilot study involving 24 radiologists and trainees demonstrated that the model can improve the expertise development for rare conditions. This dual-purpose approach offers a scalable solution for improving both diagnostic accuracy and specialist training in settings where experienced radiologists are scarce.

15. 【2603.06512】SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants

链接：https://arxiv.org/abs/2603.06512

作者：Rohit Menon,Niklas Mueller-Goldingen,Sicong Pan,Gokul Krishna Chenchani,Maren Bennewitz

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：dense crop canopies, crop canopies requires, canopies requires effective, direction-conditioned relations identifying, Robotic harvesting

备注：

点击查看摘要

Abstract:Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned relations identifying which organs obstruct a target fruit. We present SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational framework that, given instance-segmented organ point clouds, infers a scene graph encoding physical attachments and direction-conditioned occlusion. We introduce an occlusion ranking task for retrieving and ranking candidate leaves for a target fruit and approach direction, and propose a direction-aware graph neural architecture with per-fruit leaf-set attention and union-level aggregation. Experiments on a multi-plant synthetic pepper dataset show improved occlusion prediction (F1=0.73, NDCG@3=0.85) and attachment inference (edge F1=0.83) over strong ablations, yielding a structured relational signal for downstream intervention planning.

16. 【2603.06507】Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

链接：https://arxiv.org/abs/2603.06507

作者：Hila Chefer,Patrick Esser,Dominik Lorenz,Dustin Podell,Vikash Raja,Vinh Tong,Antonio Torralba,Robin Rombach

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： project webpage: [this https URL](https://bfl.ai/research/self-flow)

点击查看摘要

None

17. 【2603.06471】Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

链接：https://arxiv.org/abs/2603.06471

作者：Zhuorui Zhang,Roger Pallarès-López,Praneeth Namburi,Brian W. Anthony

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Acquiring per-frame video, deploying computer vision, Acquiring per-frame, per-frame video annotations, video annotations remains

备注：

点击查看摘要

Abstract:Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.

18. 【2603.06467】GreenRFM: Toward a resource-efficient radiology foundation model

链接：https://arxiv.org/abs/2603.06467

作者：Yingtai Li,Shuai Ming,Mingyue Zhao,Haoran Lai,Rongsheng Wang,Rui Zhou,Rundong Wang,Yujia Li,Wei Wei,Shaohua Kevin Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：radiology foundation models, brute-force scaling, radiology foundation, reliance on brute-force, foundation models

备注：

点击查看摘要

Abstract:The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need'' dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.

19. 【2603.06459】Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

链接：https://arxiv.org/abs/2603.06459

作者：Yakov Pyotr Shkolnikov

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Vision-language models encode, models encode continuous, encode continuous geometry, extracts hand joint, hand joint angles

备注：

点击查看摘要

Abstract:Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.

20. 【2603.06454】raining Flow Matching: The Role of Weighting and Parameterization

链接：https://arxiv.org/abs/2603.06454

作者：Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：output parameterization, velocity-based formulations, focus on loss, loss weighting, weighting and output

备注：

点击查看摘要

Abstract:We study the training objectives of denoising-based generative models, with a particular focus on loss weighting and output parameterization, including noise-, clean image-, and velocity-based formulations. Through a systematic numerical study, we analyze how these training choices interact with the intrinsic dimensionality of the data manifold, model architecture, and dataset size. Our experiments span synthetic datasets with controlled geometry as well as image data, and compare training objectives using quantitative metrics for denoising accuracy (PSNR across noise levels) and generative quality (FID). Rather than proposing a new method, our goal is to disentangle the various factors that matter when training a flow matching model, in order to provide practical insights on design choices.

21. 【2603.06453】Pinterest Canvas: Large-Scale Image Generation at Pinterest

链接：https://arxiv.org/abs/2603.06453

作者：Yu Wang,Eric Tzeng,Raymond Shiau,Jie Yang,Dmitry Kislyuk,Charles Rosenberg

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：simple inference adaptation, recent image generation, rendering them unsuitable, image generation, remarkable ability

备注：

点击查看摘要

Abstract:While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.

22. 【2603.06449】CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

链接：https://arxiv.org/abs/2603.06449

作者：Yitong Chen,Zuxuan Wu,Xipeng Qiu,Yu-Gang Jiang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision remains non-trivial, language models rely, remains non-trivial, extending this paradigm, Autoregressive

备注： Project website is available in [this https URL](https://sharelab-sii.github.io/catok-web)

点击查看摘要

Abstract:Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

23. 【2603.06445】What if? Emulative Simulation with World Models for Situated Reasoning

链接：https://arxiv.org/abs/2603.06445

作者：Ruiping Liu,Yufan Chen,Yuheng Zhang,Junwei Zheng,Kunyu Peng,Chengzhi Wu,Chenguang Huang,Di Wen,Jiaming Zhang,Kailun Yang,Rainer Stiefelhagen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visually impaired users, impaired users, infeasible due, due to physical, physical constraints

备注：

点击查看摘要

Abstract:Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.

24. 【2603.06426】CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation

链接：https://arxiv.org/abs/2603.06426

作者：Parhom Esmaeili,Chayanin Tangwiriyasakul,Eli Gibson,Sebastien Ourselin,M. Jorge Cardoso

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Interactive segmentation enables, segmentation enables clinicians, Interactive segmentation, consistently reach expert-level, enables clinicians

备注： 10 pages, 2 figures

点击查看摘要

Abstract:Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-level performance across diverse medical imaging tasks. Because annotation campaigns produce a growing stream of task-specific labelled data, online adaptation of the segmentation model is a natural complement to zero-shot inference. We propose CLoPA, a continual adaptation strategy that tunes a small fraction of nnInteractive's parameters on the annotation cache, triggered by lightweight episode scheduling. CLoPA requires no new parameters or changes to the inference pipeline, and operates entirely within the existing annotation workflow. Across eight Medical Segmentation Decathlon tasks spanning diverse anatomical targets and imaging characteristics, CLoPA rapidly elevates performance to expert-level, even for tasks where nnInteractive previously failed, with the majority of gains realised after a single training episode. We show that the benefits of tuning different parameter groups depends on task characteristics and data regimes. Also, that for targets with complex geometries (e.g., hepatic vessels), instance normalisation and low-level feature tuning saturates, suggesting a need for deeper feature-representation alignment in the most challenging scenarios.

25. 【2603.06421】Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision

链接：https://arxiv.org/abs/2603.06421

作者：Clemens Seibold,Anna Hilsmann,Peter Eisert

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：behavior provides relevant, relevant information, health in aquaculture, fish, Monitoring

备注： Accepted at VISAPP 2026

点击查看摘要

Abstract:Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.

26. 【2603.06408】Physical Simulator In-the-Loop Video Generation

链接：https://arxiv.org/abs/2603.06408

作者：Lin Geng Foo,Mark He Huang,Alexandros Lattas,Stylianos Moschoglou,Thabo Beeler,Christian Theobalt

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词：Recent advances, obey basic physical, basic physical laws, diffusion-based video generation, achieved remarkable visual

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: this https URL

27. 【2603.06407】Locating and Editing Figure-Ground Organization in Vision Transformers

链接：https://arxiv.org/abs/2603.06407

作者：Stefan Arnold,René Gröbner

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：global organizational priors, characteristic perceptual ambiguity, Vision Transformers, local geometric evidence, giving rise

备注：

点击查看摘要

Abstract:Vision Transformers must resolve figure-ground organization by choosing between completions driven by local geometric evidence and those favored by global organizational priors, giving rise to a characteristic perceptual ambiguity. We aim to locate where the canonical Gestalt prior convexity is realized within the internal components of BEiT. Using a controlled perceptual conflict based on synthetic shapes of darts, we systematically mask regions that equally admit either a concave completion or a convex completion. We show that BEiT reliably favors convex completion under this competition. Projecting internal activations into the model's discrete visual codebook space via logit attribution reveals that this preference is governed by identifiable functional units within transformer substructures. Specifically, we find that figure-ground organization is ambiguous through early and intermediate layers and resolves abruptly in later layers. By decomposing the direct effect of attention heads, we identify head L0H9 acting as an early seed, introducing a weak bias toward convexity. Downscaling this single attention head shifts the distributional mass of the perceptual conflict across a continuous decision boundary, allowing concave evidence to guide completion.

28. 【2603.06399】DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning

链接：https://arxiv.org/abs/2603.06399

作者：Basudha Pal,Rama Chellappa

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale annotated datasets, Facial attribute classification, relies on large-scale, large-scale annotated, inherently ambiguous

备注：

点击查看摘要

Abstract:Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence--guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.

29. 【2603.06389】Solving Jigsaw Puzzles in the Wild: Human-Guided Reconstruction of Cultural Heritage Fragments

链接：https://arxiv.org/abs/2603.06389

作者：Omidreza Safaei,Sinem Aslan,Sebastiano Vascon,Luca Palmieri,Marina Khoroshiltseva,Marcello Pelillo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reassembling real-world archaeological, fragmented pieces poses, pieces poses significant, poses significant challenges, significant challenges due

备注： 6 pages, 3 figures. Presented at the 2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP). This is the author-accepted version of the paper. The final version is available via IEEE Xplore: [this https URL](https://doi.org/10.1109/MLSP62443.2025.11204324)

点击查看摘要

Abstract:Reassembling real-world archaeological artifacts from fragmented pieces poses significant challenges due to erosion, missing regions, irregular shapes, and large-scale ambiguity. Traditional jigsaw puzzle solvers, often designed for clean synthetic scenarios, struggle under these conditions, especially when the number of fragments grows into the thousands, as in the RePAIR benchmark. In this paper, we propose a human-in-the-loop (HIL) puzzle solving framework designed to address the complexity and scale of real-world cultural heritage reconstruction. Our approach integrates an automatic relaxation-labeling solver with interactive human guidance, allowing users to iteratively lock verified placements, correct errors, and guide the system toward semantically and geometrically coherent assemblies. We introduce two complementary interaction strategies, Iterative Anchoring and Continuous Interactive Refinement, which support scalable reconstruction across varying levels of ambiguity and puzzle size. Experiments on several RePAIR groups demonstrate that our hybrid approach substantially outperforms both fully automatic and manual baselines in accuracy and efficiency, offering a practical solution for large-scale expert-in-the-loop artifact reassembly.

30. 【2603.06386】REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

链接：https://arxiv.org/abs/2603.06386

作者：Maëlic Neau,Zoe Falomir

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scene Graph Generation, encodes visual relationships, Graph Generation, Scene Graph, encodes visual

备注：

点击查看摘要

Abstract:Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at this https URL.

31. 【2603.06384】Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation

链接：https://arxiv.org/abs/2603.06384

作者：Yonghuang Wu,Zhenyang Liang,Wenwen Zeng,Xuan Xie,Jinhua Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Foundation models, remain highly sensitive, enable flexible text-guided, Segment Anything Model, flexible text-guided medical

备注：

点击查看摘要

Abstract:Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emph{prompt groups} sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.

32. 【2603.06382】CHMv2: Improvements in Global Canopy Height Mapping using DINOv3

链接：https://arxiv.org/abs/2603.06382

作者：John Brandt,Seungeun Yi,Jamie Tolan,Xinyuan Li,Peter Potapov,Jessica Ertel,Justine Spore,Huy V. Vo,Michaël Ramamonjisoa,Patrick Labatut,Piotr Bojanowski,Camille Couprie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate canopy height, airborne laser scanning, assessing habitat structure, quantifying forest carbon, canopy height information

备注： Submitted to Nature Scientific Data

点击查看摘要

Abstract:Accurate canopy height information is essential for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure, yet high-fidelity measurements from airborne laser scanning (ALS) remain unevenly available globally. Here we present CHMv2, a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models. Compared to existing products, CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure such as canopy edges and gaps. These gains are enabled by a large expansion of geographically diverse training data, automated data curation and registration, and a loss formulation and data sampling strategy tailored to canopy height distributions. We validate CHMv2 against independent ALS test sets and against tens of millions of GEDI and ICESat-2 observations, demonstrating consistent performance across major forest biomes.

33. 【2603.06378】MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

链接：https://arxiv.org/abs/2603.06378

作者：Dongqing Xie,Yonghuang Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Whole-slide image, inherent hierarchical multi-resolution, State Space Models, challenging due, gigapixel scale

备注： 15 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open this http URL propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.

34. 【2603.06374】Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation

链接：https://arxiv.org/abs/2603.06374

作者：Jonas Ernst,Wolfgang Boettcher,Lukas Hoyer,Jan Eric Lenssen,Bernt Schiele

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significantly improve weakly, improve weakly supervised, leverages recent advances, weakly supervised semantic, recent advances

备注：

点击查看摘要

Abstract:We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.

35. 【2603.06366】OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

链接：https://arxiv.org/abs/2603.06366

作者：Yuxuan Fan,Jing Hao,Hong Chen,Jiahao Bao,Yihua Shao,Yuci Liang,Kuo Feng Hung,Hao Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：bilateral symmetry understanding, require fine-grained spatial, radiographs require fine-grained, multi-step diagnostic verification, existing vision-language models

备注： 34 pages, 24 figures, conference

点击查看摘要

Abstract:Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.

36. 【2603.06362】Computer vision-based estimation of invertebrate biomass

链接：https://arxiv.org/abs/2603.06362

作者：Mikko Impiö,Philipp M. Rehsen,Jarrett Blair,Cecilie Mielec,Arne J. Beermann,Florian Leese,Toke T. Høye,Jenni Raitoharju

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：quantitative biodiversity monitoring, biodiversity monitoring efforts, estimate invertebrate biomass, invertebrate biomass, scaling up quantitative

备注：

点击查看摘要

Abstract:The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.

37. 【2603.06357】LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents

链接：https://arxiv.org/abs/2603.06357

作者：Tianhao Zhao,Youjia Zhang,Hang Long,Jinshen Zhang,Wenbing Li,Yang Yang,Gongbo Zhang,Jozef Hladký,Matthias Nießner,Wei Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：topology-preserving latent representation, Vertex Displacement Field, flow matching-based synthesis, voxel Variational Autoencoder, enables scalable

备注：

点击查看摘要

Abstract:In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.

38. 【2603.06351】Dynamic Chunking Diffusion Transformer

链接：https://arxiv.org/abs/2603.06351

作者：Akash Haridas,Utkarsh Saxena,Parsa Ashrafi Fashi,Mehdi Rezagholizadeh,Vikram Appia,Emad Barsoum

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Transformers process images, Diffusion Transformers process, Chunking Diffusion Transformer, Transformers process, Diffusion Transformers

备注：

点击查看摘要

Abstract:Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.

39. 【2603.06340】K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging

链接：https://arxiv.org/abs/2603.06340

作者：Jiajun Zeng,Shadi Albarqouni

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large-scale biomedical vision-language, biomedical vision-language models, Large-scale biomedical, frontline low-end modalities, vision-language models

备注：

点击查看摘要

Abstract:Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.

40. 【2603.06331】WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

链接：https://arxiv.org/abs/2603.06331

作者：Weilun Feng,Guoxin Fan,Haotong Qin,Chuanguang Yang,Mingqiang Wu,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Dingrui Wang,Longlong Liao,Michele Magno,Yongjun Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion-based world models, shown strong potential, iterative denoising remains, unified world simulation, Diffusion-based world

备注：

点击查看摘要

Abstract:Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in this https URL.

41. 【2603.06324】he Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

链接：https://arxiv.org/abs/2603.06324

作者：Anca Dinu,Andreiana Mihail,Andra-Maria Florescu,Claudiu Creanga

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：study explores artificial, images intentionally pastiching, artificial visual creativity, explores artificial visual, intentionally pastiching original

备注：

点击查看摘要

42. 【2603.06321】P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning

链接：https://arxiv.org/abs/2603.06321

作者：Lixin Zhan,Jie Jiang,Tianjian Zhou,Yukun Du,Yan Zheng,Xuehu Duan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：scenes heavily rely, Current semantic segmentation, cloud scenes heavily, raw point clouds, point cloud scenes

备注：

点击查看摘要

Abstract:Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.

43. 【2603.06313】WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

链接：https://arxiv.org/abs/2603.06313

作者：Peng Chen,Chao Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently shown strong, shown strong generalization, zero-shot anomaly detection, task-specific supervision, Vision-language models

备注：

点击查看摘要

Abstract:Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.

44. 【2603.06311】Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces

链接：https://arxiv.org/abs/2603.06311

作者：Eitan Shaar,Ariel Shaulov,Yalcin Tur,Gal Chechik,Ravid Shwartz-Ziv

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modern vision models, Stable Diffusion VAE, vision models, methods optimize perturbations, optimize perturbations directly

备注：

点击查看摘要

Abstract:Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under $\ell_\infty$ or $\ell_2$ constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose $\textbf{LTA}$ ($\textbf{L}$atent $\textbf{T}$ransfer $\textbf{A}$ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space $\ell_\infty$ budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.

45. 【2603.06302】DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

链接：https://arxiv.org/abs/2603.06302

作者：Walid Bousselham,Angie Boggust,Hendrik Strobelt,Hilde Kuehne

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：increasingly sophisticated, sophisticated and widely, understand their decision-making, Vision-Language Models, generation process

备注： Project page: [this https URL](https://walidbousselham.com/DEX-AR)

点击查看摘要

Abstract:As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.

46. 【2603.06300】3D CBCT Artefact Removal Using Perpendicular Score-Based Diffusion Models

链接：https://arxiv.org/abs/2603.06300

作者：Susanne Schaub,Florentin Bieder,Matheus L. Oliveira,Yulan Wang,Dorothea Dagassan-Berndt,Michael M. Bornstein,Philippe C. Cattin

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Cone-beam computed tomography, minimising radiation exposure, Cone-beam computed, offering high-resolution images, computed tomography

备注： Accepted at DGM4MICCAI 2025

点击查看摘要

Abstract:Cone-beam computed tomography (CBCT) is a widely used 3D imaging technique in dentistry, offering high-resolution images while minimising radiation exposure for patients. However, CBCT is highly susceptible to artefacts arising from high-density objects such as dental implants, which can compromise image quality and diagnostic accuracy. To reduce artefacts, implant inpainting in the sequence of projections plays a crucial role in many artefact reduction approaches. Recently, diffusion models have achieved state-of-the-art results in image generation and have widely been applied to image inpainting tasks. However, to our knowledge, existing diffusion-based methods for implant inpainting operate on independent 2D projections. This approach neglects the correlations among individual projections, resulting in inconsistencies in the reconstructed images. To address this, we propose a 3D dental implant inpainting approach based on perpendicular score-based diffusion models, each trained in two different planes and operating in the projection domain. The 3D distribution of the projection series is modelled by combining the two 2D score-based diffusion models in the sampling scheme. Our results demonstrate the method's effectiveness in producing high-quality, artefact-reduced 3D CBCT images, making it a promising solution for improving clinical imaging.

47. 【2603.06289】FlowMotion: Training-Free Flow Guidance for Video Motion Transfer

链接：https://arxiv.org/abs/2603.06289

作者：Zhen Wang,Youcan Xu,Jun Xiao,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rendering new scenes, aims to generate, generate a target, motion transfer aims, inherits motion patterns

备注：

点击查看摘要

Abstract:Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.

48. 【2603.06281】Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning

链接：https://arxiv.org/abs/2603.06281

作者：Haojie Pu,Zhuoming Li,Yongbiao Gao,Yuheng Jia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generative zero-shot learning, leveraging semantic conditions, Attribute Distribution Modeling, Attribute Distribution, zero-shot learning

备注： 17 pages, 13 figures

点击查看摘要

Abstract:Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.

49. 【2603.06279】Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise

链接：https://arxiv.org/abs/2603.06279

作者：Wenxin Li,Kunyu Peng,Di Wen,Junwei Zheng,Jiale Wei,Mengfei Duan,Yuheng Zhang,Rui Fan,Kailun Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词：real-world voxel annotations, dynamic trailing effects, annotations are inherently, inherently corrupted, noise

备注： The benchmark and source code will be made publicly available at [this https URL](https://github.com/mylwx/OccNL)

点击查看摘要

Abstract:3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at this https URL.

50. 【2603.06275】Spectral and Trajectory Regularization for Diffusion Transformer Super-Resolution

链接：https://arxiv.org/abs/2603.06275

作者：Jingkai Wang,Yixin Tang,Jue Gong,Jiatong Li,Shu Li,Libo Liu,Jianliang Lan,Yutong Liu,Yulun Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-world image super-resolution, Diffusion transformer, show great potential, image super-resolution, architectures show great

备注： 14 pages

点击查看摘要

Abstract:Diffusion transformer (DiT) architectures show great potential for real-world image super-resolution (Real-ISR). However, their computationally expensive iterative sampling necessitates one-step distillation. Existing one-step distillation methods struggle with Real-ISR on DiT. They suffer from fundamental trajectory mismatch and generate severe grid-like periodic artifacts. To tackle these challenges, we propose StrSR, a novel one-step adversarial distillation framework featuring spectral and trajectory regularization. Specifically, we propose an asymmetric discriminative distillation architecture to bridge the trajectory gap. Additionally, we design a frequency distribution matching strategy to effectively suppress DiT-specific periodic artifacts caused by high-frequency spectral leakage. Extensive experiments demonstrate that StrSR achieves state-of-the-art performance in Real-ISR, across both quantitative metrics and visual perception. The code and models will be released at this https URL .

51. 【2603.06270】HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models

链接：https://arxiv.org/abs/2603.06270

作者：Lincen Bai,Hedi Tabia,Raul Santos-Rodriguez

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Pruning vision-language models, amplifying object hallucinations, vision-language models, efficient deployment, deployment is challenging

备注：

点击查看摘要

Abstract:Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness--utility trade-offs under matched sparsity budgets.

52. 【2603.06265】ODD-SEC: Onboard Drone Detection with a Spinning Event Camera

链接：https://arxiv.org/abs/2603.06265

作者：Kuan Dai,Hongxin Zhang,Sheng Zhong,Yi Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires balancing innovation, http URL, drones requires balancing, http URL solutions, innovation with regulation

备注：

点击查看摘要

Abstract:The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant this http URL solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse this http URL by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based this http URL, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers--such as quadrupedal robots or unmanned ground vehicles--during field this http URL this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360° horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2° under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.

53. 【2603.06256】GazeMoE: Perception of Gaze Target with Mixture-of-Experts

链接：https://arxiv.org/abs/2603.06256

作者：Zhuangzhuang Dai,Zhongxi Lu,Vincent G. Zakka,Luis J. Manso,Jose M Alcaraz Calero,Chen Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Estimating human gaze, understand human attention, generalizable neural architectures, training paradigms remains, Estimating human

备注： 8 pages, 3 figures, ICRA 2026

点击查看摘要

Abstract:Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at this https URL

54. 【2603.06254】NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

链接：https://arxiv.org/abs/2603.06254

作者：Kai Luo,Xu Wang,Rui Fan,Kailun Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词：pipelines remain limited, Generalizing across unknown, semantic-blind heuristics, open-world perception, pipelines remain

备注： Code will be available at [this https URL](https://github.com/xifen523/NOVA)

点击查看摘要

Abstract:Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at this https URL.

55. 【2603.06250】Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation

链接：https://arxiv.org/abs/2603.06250

作者：Keshen Zhou,Runnan Chen,Mingming Gong,Tongliang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Referring Expression Segmentation, Referring Expression, Expression Segmentation, descriptions match multiple, scenes based

备注：

点击查看摘要

Abstract:Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.

56. 【2603.06242】DC-Merge: Improving Model Merging with Directional Consistency

链接：https://arxiv.org/abs/2603.06242

作者：Han-Chen Zhang,Zi-Hao Zhou,Mao-Lin Luo,Shimin Di,Min-Ling Zhang,Tong Wei

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：integrate multiple task-adapted, multiple task-adapted models, Model merging aims, task vectors, aims to integrate

备注： Accepted by CVPR 2026 Main Track

点击查看摘要

Abstract:Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings. The implementation code is available at this https URL.

57. 【2603.06231】aPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving

链接：https://arxiv.org/abs/2603.06231

作者：Mingyu Fan,Yi Liu,Hao Zhou,Deheng Qian,Mohammad Haziq Khan,Matthias Raetsch

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：support safe planning, autonomous driving, safe planning, essential for autonomous, vehicles to anticipate

备注：

点击查看摘要

Abstract:Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon "teachers" to short-horizon "students" via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at this https URL.

58. 【2603.06228】Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

链接：https://arxiv.org/abs/2603.06228

作者：Haiqing Hao,Zhipeng Sui,Rong Zou,Zijia Dai,Nikola Zubić,Davide Scaramuzza,Wenhui Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high temporal resolution, cameras provide sequential, provide sequential visual, sequential visual data, Event cameras provide

备注：

点击查看摘要

Abstract:Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.

59. 【2603.06220】Word-Anchored Temporal Forgery Localization

链接：https://arxiv.org/abs/2603.06220

作者：Tianyi Wang,Xi Shao,Harry Cheng,Yinglong Wang,Mohan Kankanhalli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current temporal forgery, candidate forgery proposals, continuous frame-level anomaly, derive candidate forgery, approaches typically rely

备注： Submitted for review

点击查看摘要

Abstract:Current temporal forgery localization (TFL) approaches typically rely on temporal boundary regression or continuous frame-level anomaly detection paradigms to derive candidate forgery proposals. However, they suffer not only from feature granularity misalignment but also from costly computation. To address these issues, we propose word-anchored temporal forgery localization (WAFL), a novel paradigm that shifts the TFL task from temporal regression and continuous localization to discrete word-level binary classification. Specifically, we first analyze the essence of temporal forgeries and identify the minimum meaningful forgery units, word tokens, and then align data preprocessing with the natural linguistic boundaries of speech. To adapt powerful pre-trained foundation backbones for feature extraction, we introduce the forensic feature realignment (FFR) module, mapping representations from the pre-trained semantic space to a discriminative forensic manifold. This allows subsequent lightweight linear classifiers to efficiently perform binary classification and accomplish the TFL task. Furthermore, to overcome the extreme class imbalance inherent to forgery detection, we design the artifact-centric asymmetric (ACA) loss, which breaks the standard precision-recall trade-off by dynamically suppressing overwhelming authentic gradients while asymmetrically prioritizing subtle forensic artifacts. Extensive experiments demonstrate that WAFL significantly outperforms state-of-the-art approaches in localization performance under both in- and cross-dataset settings, while requiring substantially fewer learnable parameters and operating at high computational efficiency.

60. 【2603.06216】EntON: Eigenentropy-Optimized Neighborhood Densification in 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.06216

作者：Miriam Jäger,Boris Jutzi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Eigenentropy-optimized neighboorhood densification, Eigenentropy-optimized neighboorhood, Gaussian Splatting, neighboorhood densification strategy, high-quality rendered

备注： Submitted to ISPRS Journal of Photogrammetry and Remote Sensing on 20 February 2026

点击查看摘要

Abstract:We present a novel Eigenentropy-optimized neighboorhood densification strategy EntON in 3D Gaussian Splatting (3DGS) for geometrically accurate and high-quality rendered 3D reconstruction. While standard 3DGS produces Gaussians whose centers and surfaces are poorly aligned with the underlying object geometry, surface-focused reconstruction methods frequently sacrifice photometric accuracy. In contrast to the conventional densification strategy, which relies on the magnitude of the view-space position gradient, our approach introduces a geometry-aware strategy to guide adaptive splitting and pruning. Specifically, we compute the 3D shape feature Eigenentropy from the eigenvalues of the covariance matrix in the k-nearest neighborhood of each Gaussian center, which quantifies the local structural order. These Eigenentropy values are integrated into an alternating optimization framework: During the optimization process, the algorithm alternates between (i) standard gradient-based densification, which refines regions via view-space gradients, and (ii) Eigenentropy-aware densification, which preferentially densifies Gaussians in low-Eigenentropy (ordered, flat) neighborhoods to better capture fine geometric details on the object surface, and prunes those in high-Eigenentropy (disordered, spherical) regions. We provide quantitative and qualitative evaluations on two benchmark datasets: small-scale DTU dataset and large-scale TUM2TWIN dataset, covering man-made objects and urban scenes. Experiments demonstrate that our Eigenentropy-aware alternating densification strategy improves geometric accuracy by up to 33% and rendering quality by up to 7%, while reducing the number of Gaussians by up to 50% and training time by up to 23%. Overall, EnTON achieves a favorable balance between geometric accuracy, rendering quality and efficiency by avoiding unnecessary scene expansion.

61. 【2603.06213】Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

链接：https://arxiv.org/abs/2603.06213

作者：Xiaoxing You,Qiang Huang,Lingyu Li,Xiaojun Chang,Jun Yu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal Summarization, concise textual summaries, generate concise textual, aims to generate, generate concise

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at this https URL.

62. 【2603.06210】VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

链接：https://arxiv.org/abs/2603.06210

作者：Xiaoyang Yan,Muleilan Pei,Shaojie Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：crucial perception task, comprehensive scene understanding, Vision Foundation Models, autonomous driving, Grounded Gaussian Splatting

备注：

点击查看摘要

Abstract:3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.

63. 【2603.06201】Point-Supervised Skeleton-Based Human Action Segmentation

链接：https://arxiv.org/abs/2603.06201

作者：Hongsong Wang,Yiqin Shen,Pengbo Yan,Jie Gui

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling intelligent systems, Skeleton-based temporal action, temporal action segmentation, challenging task, playing a crucial

备注：

点击查看摘要

Abstract:Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.

64. 【2603.06200】Adaptive Language-Aware Image Reflection Removal Network

链接：https://arxiv.org/abs/2603.06200

作者：Siyan Fang,Yuntao Wang,Jinpu Zhang,Ziwen Li,Yuehuan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing image reflection, Existing image, image reflection removal, complex reflections, handle complex reflections

备注： IJCAI 2025

点击查看摘要

Abstract:Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model's performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at this https URL.

65. 【2603.06186】SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

链接：https://arxiv.org/abs/2603.06186

作者：Shuailin Xue,Jun Wan,Lihua Zhang,Wenwen Min

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offers crucial insights, CTR detection, treatment response, tumor microenvironment, microenvironment and offers

备注： Accepted by AAAI-2026-Oral

点击查看摘要

Abstract:Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.

66. 【2603.06183】CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

链接：https://arxiv.org/abs/2603.06183

作者：Mohammed Baharoon,Thibault Heintz,Siavash Raissi,Mahmoud Alabbad,Mona Alhammad,Hassan AlOmaish,Sung Eun Kim,Oishi Banerjee,Pranav Rajpurkar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：chest X-ray report, chest X-ray, X-ray report generation, contextual relevance, X-ray report

备注：

点击查看摘要

67. 【2603.06181】owards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

链接：https://arxiv.org/abs/2603.06181

作者：Mingzhe Li,Mengyin Liu,Zekai Wu,Xincheng Lin,Junsheng Zhang,Ming Yan,Zengye Xie,Changwang Zhang,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Turing Test, Motion Turing Test, achieved significant progress, generation and control, natural and human-like

备注： 13 pages, 10 figures, conference

点击查看摘要

Abstract:Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.

68. 【2603.06180】Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning

链接：https://arxiv.org/abs/2603.06180

作者：Claire Roman,Philippe Meyer

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Learning similarity metrics, scripts remain uncertain, fundamental challenge, uncertain and contested, similarity metrics

备注：

点击查看摘要

69. 【2603.06178】Making Training-Free Diffusion Segmentors Scale with the Generative Power

链接：https://arxiv.org/abs/2603.06178

作者：Benyuan Meng,Qianqian Xu,Zitai Wang,Xiaochun Cao,Longtao Huang,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：discriminative tasks, recently been explored, explored for discriminative, diffusion models, powerful diffusion models

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at this https URL.

70. 【2603.06173】Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning

链接：https://arxiv.org/abs/2603.06173

作者：Yueying Tian,Xudong Han,Meng Zhou,Rodrigo Aviles-Espinosa,Rupert Young,Philip Birch

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：standard training objectives, clinical relevance remains, medical image generation, Diffusion models, medical image

备注： Preprint

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.

71. 【2603.06168】JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

链接：https://arxiv.org/abs/2603.06168

作者：Sandeep Inuganti,Hideaki Kanayama,Kanta Shimizu,Mahdi Chamseddine,Soichiro Yokota,Didier Stricker,Jason Rambach

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：panoramic images remains, point cloud data, challenging task, primarily due, fixed-label models

备注：

点击查看摘要

Abstract:Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

72. 【2603.06167】A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement

链接：https://arxiv.org/abs/2603.06167

作者：Ruili Li,Jiayi Ding,Ruiyu Li,Yilun Jin,Shiwen Ge,Yuwen Zeng,Xiaoyong Zhang,Eichi Takaya,Jan Vrba,Noriyasu Homma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：extremely limited annotations, leading to inaccurate, BUS images remains, suffers from unstable, unstable pseudo labels

备注：

点击查看摘要

Abstract:Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.06167 [cs.CV]

(or
arXiv:2603.06167v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.06167

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

73. 【2603.06166】FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

链接：https://arxiv.org/abs/2603.06166

作者：Andrew Caunes,Thierry Chateau,Vincent Fremont

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：ego vehicle surroundings, road scene analysis, vehicle surroundings, ego vehicle, panoptic occupancy

备注： 14 pages

点击查看摘要

Abstract:Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.

74. 【2603.06165】Reflective Flow Sampling Enhancement

链接：https://arxiv.org/abs/2603.06165

作者：Zikai Zhou,Muyao Wang,Shitong Shao,Lichen Bai,Haoyi Xiong,Bo Han,Zeke Xie

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：generative modeling, diffusion models, conventional diffusion models, growing demand, led to rapid

备注：

点击查看摘要

Abstract:The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.

75. 【2603.06148】VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

链接：https://arxiv.org/abs/2603.06148

作者：Rohit Saxena,Alessandro Suglia,Pasquale Minervini

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：real-world image distortions, high-quality datasets, Vision-language models, fully understand, perform under real-world

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

76. 【2603.06147】Longitudinal NSCLC Treatment Progression via Multimodal Generative Models

链接：https://arxiv.org/abs/2603.06147

作者：Massimiliano Mantegna,Elena Mulero Ayllón,Alice Natalina Caragliano,Francesco Di Feola,Claudia Tacconi,Michele Fiore,Edy Ippolito,Carlo Greco,Sara Ramella,Philippe C. Cattin,Paolo Soda,Matteo Tortora,Valerio Guarrasi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinically critical challenge, Predicting tumor evolution, critical challenge, Predicting tumor, clinically critical

备注：

点击查看摘要

Abstract:Predicting tumor evolution during radiotherapy is a clinically critical challenge, particularly when longitudinal changes are driven by both anatomy and treatment. In this work, we introduce a Virtual Treatment (VT) framework that formulates non-small cell lung cancer (NSCLC) progression as a dose-aware multimodal conditional image-to-image translation problem. Given a CT scan, baseline clinical variables, and a specified radiation dose increment, VT aims to synthesize plausible follow-up CT images reflecting treatment-induced anatomical changes. We evaluate the proposed framework on a longitudinal dataset of 222 stage III NSCLC patients, comprising 895 CT scans acquired during radiotherapy under irregular clinical schedules. The generative process is conditioned on delivered dose increments together with demographic and tumor-related clinical variables. Representative GAN-based and diffusion-based models are benchmarked across 2D and 2.5D configurations. Quantitative and qualitative results indicate that diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable and anatomically plausible tumor evolution trajectories than GAN-based baselines, supporting the potential of VT as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC.

77. 【2603.06141】Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models

链接：https://arxiv.org/abs/2603.06141

作者：Nicoleta-Nina Basoc,Adrian Cosma,Emilian Radoi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieve strong benchmark, strong benchmark results, systematic perceptual weaknesses, exhibit systematic perceptual, underlying scene remains

备注：

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.

78. 【2603.06140】Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

链接：https://arxiv.org/abs/2603.06140

作者：Bohai Gu,Taiyi Wu,Dazhao Du,Jian Liu,Shuai Yang,Xiaotong Zhao,Alan Zhao,Song Guo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved high visual, Modern video editing, Modern video, Multimodal Large Language, high visual fidelity

备注： [this https URL](https://nevsnev.github.io/Place-it-R1/)

点击查看摘要

Abstract:Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.

79. 【2603.06136】Cross-Resolution Distribution Matching for Diffusion Distillation

链接：https://arxiv.org/abs/2603.06136

作者：Feiyang Chen,Hongpeng Pan,Haonan Xu,Xinyu Duan,Yang Yang,Zhefeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion distillation, denoising process, largely saturated, image and video, existing methods

备注：

点击查看摘要

Abstract:Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher's high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.

80. 【2603.06122】FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

链接：https://arxiv.org/abs/2603.06122

作者：Xin Xu,Binchang Ma,Zhixi Yu,Wei Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：client data privacy, protecting client data, model generalization ability, person re-identification, aims to enhance

备注：

点击查看摘要

Abstract:The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).

81. 【2603.06090】DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

链接：https://arxiv.org/abs/2603.06090

作者：Hao Yang,Hongbo Zhang,Yanyan Zhao,Bing Qin

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Multimodal large language, large language models, accurately interpret depth, achieved impressive performance, depth

备注：

点击查看摘要

82. 【2603.06081】Lyapunov Probes for Hallucination Detection in Large Foundation Models

链接：https://arxiv.org/abs/2603.06081

作者：Bozhi Luan,Gen Li,Yalan Qin,Jifeng Guo,Yun Zhou,Faguo Wu,Hongwei Zheng,Wenjun Wu,Zhaoxin Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, Language Models

备注：

点击查看摘要

Abstract:We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.

83. 【2603.06071】xt-Driven Emotionally Continuous Talking Face Generation

链接：https://arxiv.org/abs/2603.06071

作者：Hao Yang,Yanyan Zhao,Tian Zheng,Hongbo Zhang,Bichen Wang,Di Wu,Xing Fu,Xuda Zhi,Yongbo Huang,Hao He

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Talking Face Generation, expressive digital faces, Face Generation, emotionally expressive digital, Talking Face

备注：

点击查看摘要

Abstract:Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.

84. 【2603.06061】ransforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting

链接：https://arxiv.org/abs/2603.06061

作者：Semin Bae,Hansol Lim,Jongseong Brad Choi

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：autonomous driving, Gaussian Splatting, demand for large-scale, rapidly growing, growing in robotics

备注： This work has been submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication

点击查看摘要

Abstract:The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.

85. 【2603.06057】mpoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

链接：https://arxiv.org/abs/2603.06057

作者：Soumya Mazumdar,Vineet Kumar Rakesh

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)

关键词：challenging speech conditions, recently advanced photorealistic, advanced photorealistic human, imperfect audio-visual alignment, photorealistic human synthesis

备注：

点击查看摘要

Abstract:Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: this https URL

86. 【2603.06054】Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

链接：https://arxiv.org/abs/2603.06054

作者：Nikos Theodoridis,Reenu Mohandas,Ganesh Sistu,Anthony Scanlan,Ciarán Eising,Tim Brophy

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：long tail scenarios, handle long tail, automated driving applications, automated driving, visual

备注：

点击查看摘要

Abstract:The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.

87. 【2603.06049】Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

链接：https://arxiv.org/abs/2603.06049

作者：Canyu Chen,Yuguang Yang,Zhewen Tan,Yizhi Wang,Ruiyi Zhan,Haiyan Liu,Xuanyao Mao,Jason Bao,Xinyue Tang,Linlin Yang,Bingchuan Sun,Yan Wang,Baochang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：subsequent Reinforcement Learning, fundamental Narrow Policy, Narrow Policy limitation, driving Imitation Learning, Policy limitation undermining

备注： Accepted by CVPR2026 findings

点击查看摘要

Abstract:We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: this https URL.

88. 【2603.06048】GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

链接：https://arxiv.org/abs/2603.06048

作者：Xuan Huang,Mochu Xiang,Zhelun Shen,Jinbo Wu,Chenming Wu,Chen Zhao,Kaisiyuan Wang,Hang Zhou,Shanshan Liu,Haocheng Feng,Wei He,Jingdong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generate physically plausible, physically plausible contact, human video synthesis, digital human video, preserve object identity

备注：

点击查看摘要

Abstract:Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: this https URL

89. 【2603.06043】Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

链接：https://arxiv.org/abs/2603.06043

作者：Jiadong Pan,Liang Li,Yuxin Peng,Yu-Ming Tang,Shuohuan Wang,Yu Sun,Hua Wu,Qingming Huang,Haifeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unified multimodal models, demonstrating strong potential, made remarkable progress, unified multimodal, multimodal models

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals--without reliance on external supervision. Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.

90. 【2603.06038】FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

链接：https://arxiv.org/abs/2603.06038

作者：Xia Xin,Yuki Endo,Yoshihiro Kanamori

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：typography remains challenging, generate high-quality images, controlling typography remains, requested typographic appearance, remains challenging

备注：

点击查看摘要

Abstract:Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at this https URL.

91. 【2603.06036】Ensemble Learning with Sparse Hypercolumns

链接：https://arxiv.org/abs/2603.06036

作者：Julia Dietlmeier,Vayangi Ganepola,Oluwabukola G. Adegboro,Mayug Maniparambil,Claudia Mazo,Noel E. O'Connor

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：feature vectors built, convolutional neural networks, image pixel location, Directly inspired, single image pixel

备注： presented at 33rd International Conference on Artificial Intelligence and Cognitive Science (AICS 2025)

点击查看摘要

Abstract:Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.

92. 【2603.06034】Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking

链接：https://arxiv.org/abs/2603.06034

作者：Chunjiang Li,Jianbo Ma,Li Shen,Yanru Chen,Liangyin Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：involves analyzing object, Multi-object tracking, analyzing object trajectories, involves analyzing, video sequences

备注： The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR2026)

点击查看摘要

Abstract:Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.

93. 【2603.06032】StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

链接：https://arxiv.org/abs/2603.06032

作者：Yuanhuiyi Lyu,Kaiyu Lei,Ziqiao Weng,Xu Zheng,Lutao Jiang,Teng Li,Yangfu Li,Ziyuan Huang,Linfeng Zhang,Xuming Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex prompts accurately, interpret complex prompts, generation requires models, reasoning, prompts accurately

备注：

点击查看摘要

Abstract:Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively "perceive" visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.

94. 【2603.06024】ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

链接：https://arxiv.org/abs/2603.06024

作者：Xingjian Tao,Yiwei Wang,Yujun Cai,Yifan Song,Jing Tang

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：current vision-language models, Multi-view spatial reasoning, reasoning remains difficult, Multi-view spatial, remains difficult

备注：

点击查看摘要

95. 【2603.06022】MOSIV: Multi-Object System Identification from Videos

链接：https://arxiv.org/abs/2603.06022

作者：Chunjiang Liu,Xiaoyuan Wang,Qingran Lin,Albert Xiao,Haoyu Chen,Shizheng Wen,Hao Zhang,Lu Qi,Ming-Hsuan Yang,Laszlo A. Jeni,Min Xu,Yizhou Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：discrete material classification, multi-object system identification, identification from videos, introduce the challenging, challenging problem

备注： ICLR 2026

点击查看摘要

Abstract:We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.

96. 【2603.06014】EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation

链接：https://arxiv.org/abs/2603.06014

作者：Shiyuan Yang,Ruihuang Li,Jiale Tao,Shuai Shao,Qinglin Lu,Jing Liao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：costly production pipelines, typically requires expert, requires expert knowledge, production pipelines, essential for enhancing

备注： Project page: [this https URL](https://effectmaker.github.io)

点击查看摘要

Abstract:Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: this https URL

97. 【2603.06002】Demystifying KAN for Vision Tasks: The RepKAN Approach

链接：https://arxiv.org/abs/2603.06002

作者：Minjong Cheon

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Remote sensing image, sensing image classification, Earth observation, Remote sensing, essential for Earth

备注：

点击查看摘要

Abstract:Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design -- Spatial Linear and Spectral Non-linear -- RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.

98. 【2603.06001】Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

链接：https://arxiv.org/abs/2603.06001

作者：Ninghao Zhang,Bin Zhu,Shijie Zhou,Jingjing Chen

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：models enable robots, enable robots, robots to perform, increasingly viewed, foundation for generalist

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.

99. 【2603.05999】RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

链接：https://arxiv.org/abs/2603.05999

作者：Cheng Guan,Chunyu Lin,Zhijie Shen,Junsong Zhang,Jiyuan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent depth foundation, substantial geometric discrepancy, Recent depth, achieve strong performance, depth foundation models

备注：

点击查看摘要

Abstract:Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360$^\circ$ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1\% of the training data. Under the same in-domain training setting, it further achieves an approximately 20\% improvement in RMSE. Code will be released upon acceptance.

100. 【2603.05997】MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs

链接：https://arxiv.org/abs/2603.05997

作者：Zhi Lei,Chenxi Liu,Hao Miao,Wanghui Qiu,Bin Yang,Chenjuan Guo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Irregularly sampled time, sampled time series, uneven time intervals, Irregularly sampled, exhibiting asynchronous observations

备注：

点击查看摘要

Abstract:Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.

101. 【2603.05987】chnical Report: Automated Optical Inspection of Surgical Instruments

链接：https://arxiv.org/abs/2603.05987

作者：Zunaira Shafqat,Atif Aftab Ahmed Jilani,Qurrat Ul Ain

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：maintaining the highest, clinical success, dynamic landscape, landscape of modern, surgical instruments

备注： 20 pages, 33 figures, 6 tables. Technical Report

点击查看摘要

Abstract:In the dynamic landscape of modern healthcare, maintaining the highest standards in surgical instruments is critical for clinical success. This report explores the diverse realm of surgical instruments and their associated manufacturing defects, emphasizing their pivotal role in ensuring the safety of surgical procedures. With potentially fatal consequences arising from even minor defects, precision in manufacturing is this http URL report addresses the identification and rectification of critical defects such as cracks, rust, and structural irregularities. Such scrutiny prevents substantial financial losses for manufacturers and, more crucially, safeguards patient lives. The collaboration with industry leaders Daddy D Pro and Dr. Frigz International, renowned trailblazers in the Sialkot surgical cluster, provides invaluable insights into the analysis of defects in Pakistani-made instruments. This partnership signifies a commitment to advancing automated defect detection methodologies, specifically through the integration of deep learning architectures including YOLOv8, ResNet-152, and EfficientNet-b4, thereby elevating quality standards in the manufacturing process. The scope of this report is to identify various surgical instruments manufactured in Pakistan and analyze their associated defects using a newly developed dataset of 4,414 high-resolution images. By focusing on quality assurance through Automated Optical Inspection (AOI) tools, this document serves as a resource for manufacturers, healthcare professionals, and regulatory bodies. The insights gained contribute to the enhancement of instrument standards, ensuring a more reliable healthcare environment through industry expertise and cutting-edge technology.

102. 【2603.05982】HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

链接：https://arxiv.org/abs/2603.05982

作者：Ziyang Zhao,Shuheng Wang,Zhonghua Miao,Ya Xiong

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：tabletop strawberry harvesting, unstructured task challenged, greenhouse tabletop strawberry, real greenhouse tabletop, study on transferring

备注：

点击查看摘要

Abstract:This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections. We built an end-to-end closed-loop system on the HarvestFlex platform using three-view RGB sensing (two fixed scene views plus a wrist-mounted view) and intentionally avoided depth clouds and explicit geometric calibration. We collected 3.71 h of VR teleoperated demonstrations (227 episodes) and fine-tuned pi_0, pi_0.5, and WALL-OSS with full fine-tuning and LoRA. Under a unified 50 trials real-greenhouse protocol and metrics spanning completion, pi_0.5 with full fine-tuning achieved success rate of 74.0% with 32.6 s/pick and damage rate of 4.1%. Asynchronous inference-control decoupling further improved performance over synchronous deployment. Results showed non-trivial closed-loop picking with fewer than four hours of real data, while remaining limited by close-range observability loss and contact-dynamics mismatch. A demonstration video is available at: this https URL.

103. 【2603.05971】owards High-resolution and Disentangled Reference-based Sketch Colorization

链接：https://arxiv.org/abs/2603.05971

作者：Dingkun Yan,Xinrui Wang,Ru Wang,Zhuoru Li,Jinze Yu,Yusuke Iwasawa,Yutaka Matsuo,Jiaxian Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Sketch colorization, distribution shift, digital illustrations, critical task, task for automating

备注：

点击查看摘要

Abstract:Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL's conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.

104. 【2603.05970】Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

链接：https://arxiv.org/abs/2603.05970

作者：Jingtao Ye,Kexin Zhang,Xunchi Ma,Yuehan Li,Guangming Zhu,Peiyi Shen,Linhua Jiang,Xiangdong Zhang,Liang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：induce significant observational, unmanned aerial vehicles, UAV-perspective MOT, significant observational challenges, induce significant

备注：

点击查看摘要

Abstract:The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.

105. 【2603.05969】Imagine How To Change: Explicit Procedure Modeling for Change Captioning

链接：https://arxiv.org/abs/2603.05969

作者：Jiayang Sun,Zixin Guo,Min Cao,Guibo Zhu,Jorma Laaksonen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：visually similar images, captioning generates descriptions, Change captioning generates, generates descriptions, descriptions that explicitly

备注： Accepted to ICLR 2026. Code and models are available at [this https URL](https://github.com/BlueberryOreo/ProCap)

点击查看摘要

106. 【2603.05965】PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

链接：https://arxiv.org/abs/2603.05965

作者：Jinseop Lee,Byoungho Lee,Gichul Yoo

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Occupancy BEV Encoding, PRobabilistic Occupancy BEV, BEV cell occupancy, Bernoulli random variable, BEV Encoding

备注： 8 pages, 8 figures

点击查看摘要

Abstract:We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $\sigma_\theta = \sigma_t / r$ in $\mathcal{O}(R \times S)$ time. The primary parameter $\sigma_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity allowing cross-sensor generalization without per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance against both handcrafted and supervised baselines. The source code and supplementary materials are available at this https URL.

107. 【2603.05964】CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection

链接：https://arxiv.org/abs/2603.05964

作者：Jinyeong Park,Donghwa Kim,Brent ByungHoon Kang,Hyeongboo Baek,Jibum Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Open-vocabulary object detection, sizes hinder deployment, Open-vocabulary object, massive model sizes, model sizes hinder

备注：

点击查看摘要

Abstract:Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher's multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.

108. 【2603.05963】Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

链接：https://arxiv.org/abs/2603.05963

作者：Siyuan Yang,Jun Liu,Hao Cheng,Chong Wang,Shijian Lu,Hedvig Kjellstrom,Weisi Lin,Alex C. Kot

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：demonstrated impressive capabilities, Recent advances, large-scale pretrained vision, pretrained vision models, downstream tasks

备注： Submitted to IEEE TPAMI, under review

点击查看摘要

Abstract:Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.

109. 【2603.05962】Exploring Open-Vocabulary Object Recognition in Images using CLIP

链接：https://arxiv.org/abs/2603.05962

作者：Wei Yu Chen,Ying Dai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：substantial training costs, specifically high system, high system complexity, streamlined two-stage strategy, existing open-vocabulary object

备注：

点击查看摘要

Abstract:To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.

110. 【2603.05959】OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

链接：https://arxiv.org/abs/2603.05959

作者：Si-Yu Lu,Po-Ting Chen,Hui-Che Hsu,Sin-Ye Jhong,Wen-Huang Cheng,Yung-Yao Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires continuous inference, video requires continuous, bounded resources, requires continuous, Reconstructing

备注：

点击查看摘要

Abstract:Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

111. 【2603.05952】Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

链接：https://arxiv.org/abs/2603.05952

作者：Hongli Liu,Yu Wang,Shengjie Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：gained significant attention, Few-shot segmentation, limited supervision, gained significant, significant attention

备注： Accepted by CVPR Findings 2026

点击查看摘要

Abstract:Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at this https URL.

112. 【2603.05950】Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

链接：https://arxiv.org/abs/2603.05950

作者：Jialuo He,Huangxun Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：accelerating Vision-Language Models, existing approaches rely, Vision-Language Models, fixed budget shared, image information density

备注：

点击查看摘要

Abstract:Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

113. 【2603.05947】LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

链接：https://arxiv.org/abs/2603.05947

作者：Song Fei,Tian Ye,Sixiang Chen,Zhaohu Xing,Jianyu Lai,Lei Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：synthesize visually convincing, visually convincing details, critical failure mode, failure mode hard, severely degraded low-resolution

备注：

点击查看摘要

Abstract:Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.

114. 【2603.05942】Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation

链接：https://arxiv.org/abs/2603.05942

作者：Luan Pham,Phu Hao Hoang,Xuan Toan Mai,Tuan Anh Tran

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：subsequent steps directly, impacts subsequent steps, document processing systems, performance impacts subsequent, scanned document images

备注： This paper has been accepted to ICIP 2022

点击查看摘要

Abstract:Skew estimation is one of the vital tasks in document processing systems, especially for scanned document images, because its performance impacts subsequent steps directly. Over the years, an enormous number of researches focus on this challenging problem in the rise of digitization age. In this research, we first propose a novel skew estimation method that extracts the dominant skew angle of the given document image by applying an Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum. Second, we introduce a high quality skew estimation dataset DISE-2021 to assess the performance of different estimators. Finally, we provide comprehensive analyses that focus on multiple improvement aspects of Fourier-based methods. Our results show that the proposed method is robust, reliable, and outperforms all compared methods. The source code is available at this https URL.

115. 【2603.05940】SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration

链接：https://arxiv.org/abs/2603.05940

作者：Peng Shurui,Xin Lin,Shi Luo,Jincen Ou,Dizhe Zhang,Lu Qi,Truong Nguyen,Chao Ren

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：insufficient expert specialization, Image restoration, diverse degradations remains, degradations remains challenging, challenging for unified

备注：

点击查看摘要

Abstract:Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.

116. 【2603.05937】Facial Expression Recognition Using Residual Masking Network

链接：https://arxiv.org/abs/2603.05937

作者：Luan Pham, TheHuynh Vu,Tuan Anh Tran

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Automatic facial expression, facial expression recognition, Automatic facial, improve FER tasks, human-computer interaction

备注：

点击查看摘要

Abstract:Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. The source code is available at this https URL.

117. 【2603.05936】OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

链接：https://arxiv.org/abs/2603.05936

作者：Kota Shimomura,Masaki Nambata,Atsuya Ishikawa,Ryota Mimura,Takayuki Kawabuchi,Takayoshi Yamashita,Koki Inoue

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high perception performance, handling rare situations, demonstrate high perception, perception performance, autonomous driving systems

备注： Accepted ICCV2025

点击查看摘要

Abstract:Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.

118. 【2603.05932】FTSplat: Feed-forward Triangle Splatting Network

链接：https://arxiv.org/abs/2603.05932

作者：Xiong Jinlin,Li Can,Shen Jiawei,Qi Zhigang,Sun Lei,Zhao Dongyang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：High-fidelity three-dimensional, Neural Radiance Fields, Gaussian Splatting, Radiance Fields, feed-forward Gaussian splatting

备注：

点击查看摘要

Abstract:High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.

119. 【2603.05929】Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

链接：https://arxiv.org/abs/2603.05929

作者：Hongwei Fang,Jiahang Cai,Xun Wang,Wenwu Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：global modeling capability, strong global modeling, Vision Transformer tailored, recently achieved, modeling capability

备注：

点击查看摘要

Abstract:Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: this https URL.

120. 【2603.05926】owards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes

链接：https://arxiv.org/abs/2603.05926

作者：Nakul Agarwal,Yi-Ting Chen,Behzad Dariush

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Achieving zero-collision mobility, intelligent vehicle systems, zero-collision mobility remains, perception-a complex cognitive, complex cognitive process

备注： Accepted to IV 2026

点击查看摘要

Abstract:Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver's intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver's intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.

121. 【2603.05925】RAC: Rectified Flow Auto Coder

链接：https://arxiv.org/abs/2603.05925

作者：Sen Fang,Yalin Feng,Yanxin Zhang,Dimitris N. Metaxas

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Rectified Flow Auto, Flow Auto Coder, Rectified Flow, Flow Auto, Auto Coder

备注： 11 Figures, 4 Tables. Project Page at [this https URL](https://world-snapshot.github.io/RAC/)

点击查看摘要

Abstract:In this paper, we propose a Rectified Flow Auto Coder (RAC) inspired by Rectified Flow to replace the traditional VAE: 1. It achieves multi-step decoding by applying the decoder to flow timesteps. Its decoding path is straight and correctable, enabling step-by-step refinement. 2. The model inherently supports bidirectional inference, where the decoder serves as the encoder through time reversal (hence Coder rather than encoder or decoder), reducing parameter count by nearly 41%. 3. This generative decoding method improves generation quality since the model can correct latent variables along the path, partially addressing the reconstruction--generation gap. Experiments show that RAC surpasses SOTA VAEs in both reconstruction and generation with approximately 70% lower computational cost.

122. 【2603.05921】BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

链接：https://arxiv.org/abs/2603.05921

作者：Feiran Li,Qianqian Xu,Shilong Bao,Zhiyong Yang,Xilin Zhao,Xiaochun Cao,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：models under black-box, paper investigates, investigates the challenging, challenging task, task of detecting

备注： This paper is accepted by CVPR 2026

点击查看摘要

Abstract:This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at this https URL.

123. 【2603.05911】CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

链接：https://arxiv.org/abs/2603.05911

作者：Yuxin Xie,Yuming Chen,Yishan Yang,Yi Zhou,Tao Zhou,Zhen Zhao,Jiacheng Liu,Huazhu Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Medical image segmentation, Medical image, Multimodal Large Language, cognitive reasoning analysis, conventional visual pattern

备注： Under Review with Computational Visual Media

点击查看摘要

Abstract:Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: this https URL

124. 【2603.05908】Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

链接：https://arxiv.org/abs/2603.05908

作者：Zidian Qiu,Ancong Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：inflexible joint object-layout, Current compositional, generation approaches construct, joint object-layout generation, time-consuming iterative layout

备注： Accepted to CVPR 2026. Project page: [this https URL](https://qiuzidian.github.io/pano3dcomposer-page/)

点击查看摘要

Abstract:Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360-degree environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. Project page: this https URL.

125. 【2603.05906】Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

链接：https://arxiv.org/abs/2603.05906

作者：Ping Chen,Zezhou Chen,Xingpeng Zhang,Yanlin Qian,Huan Hu,Xiang Liu,Zipeng Wang,Xin Wang,Zhaoxiang Liu,Kai Wang,Shiguo Lian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：emotionally resonant experience, achieve geometric accuracy, failing to replicate, replicate the immersive, immersive and emotionally

备注： Accepet by CVPR 2026 (10 pages, 4 figures)

点击查看摘要

Abstract:Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.

126. 【2603.05905】CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection

链接：https://arxiv.org/abs/2603.05905

作者：Xuecheng Bai,Yuxiang Wang,Chuanzhi Xu,Boyu Hu,Kang Han,Ruijie Pan,Xiaowei Niu,Xiaotian Guan,Liqiang Fu,Pengfei Ye

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unmanned aerial vehicle, limited computational resources, Small object detection, Small object, structural detail degradation

备注：

点击查看摘要

Abstract:Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: this https URL.

127. 【2603.05899】Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification

链接：https://arxiv.org/abs/2603.05899

作者：Schrasing Tong,Antoine Salaun,Vincent Yuan,Annabel Adeyeri,Lalana Kagal

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Ensuring fairness, perpetuating and amplifying, classification prevents models, image classification prevents, Ensuring

备注：

点击查看摘要

Abstract:Ensuring fairness in image classification prevents models from perpetuating and amplifying bias. Concept bottleneck models (CBMs) map images to high-level, human-interpretable concepts before making predictions via a sparse, one-layer classifier. This structure enhances interpretability and, in theory, supports fairness by masking sensitive attribute proxies such as facial features. However, CBM concepts have been known to leak information unrelated to concept semantics and early results reveal only marginal reductions in gender bias on datasets like ImSitu. We propose three bias mitigation techniques to improve fairness in CBMs: 1. Decreasing information leakage using a top-k concept filter, 2. Removing biased concepts, and 3. Adversarial debiasing. Our results outperform prior work in terms of fairness-performance tradeoffs, indicating that our debiased CBM provides a significant step towards fair and interpretable image classification.

128. 【2603.05898】InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation

链接：https://arxiv.org/abs/2603.05898

作者：Yuxin Qin,Ke Cao,Haowei Liu,Ao Ma,Fengheng Li,Honghe Zhu,Zheng Zhang,Run Ling,Wei Feng,Xuanhua He,Zhanjie Zhang,Zhen Guo,Haoyi Bian,Jingjing Lv,Junjie Shen,Ching Law

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：effectively conveys product, conveys product information, poster generation aims, E-commerce product poster, product poster generation

备注： Accepted by CVPR2026

点击查看摘要

Abstract:E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.

129. 【2603.05888】PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

链接：https://arxiv.org/abs/2603.05888

作者：Xiang Zhang,Sohyun Yoo,Hongrui Wu,Chuan Li,Jianwen Xie,Zhuowen Tu

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：single RGB image, autoregressively reconstruct complete, single RGB, reconstruct complete, RGB image

备注： CVPR 2026. Project Page: [this https URL](https://mlpc-ucsd.github.io/PixARMesh)

点击查看摘要

Abstract:We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.

130. 【2603.05882】CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis

链接：https://arxiv.org/abs/2603.05882

作者：Qiwei Wang,Xianghui Ze,Jingyi Yu,Yujiao Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, imagery remains challenging, shown great promise, panoramic imagery remains, remains challenging

备注：

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of $360^\circ$ scenes, leading to distortion and aliasing. In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new {cylindrical Triplane} representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.05882 [cs.CV]

(or
arXiv:2603.05882v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.05882

Focus to learn more

              arXiv-issued DOI via DataCite</p>

131. 【2603.05876】Systematic Evaluation of Novel View Synthesis for Video Place Recognition

链接：https://arxiv.org/abs/2603.05876

作者：Muhammad Zawad Mahmud,Samiha Islam,Damian Lyons

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：positively impact robot, impact robot navigation, potential to positively, positively impact, Video Place Recognition

备注： Submitted to IEEE IROS 2026

点击查看摘要

Abstract:The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.

Comments:
Submitted to IEEE IROS 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Cite as:
arXiv:2603.05876 [cs.CV]

(or
arXiv:2603.05876v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.05876

Focus to learn more

              arXiv-issued DOI via DataCite</p>

132. 【2603.05873】Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation

链接：https://arxiv.org/abs/2603.05873

作者：Bowen Chen,Qiaohui Gao,Shaowen Wan,Shanhui Sun,Wei Liu,Xiang Li,Tianming Liu,Lin Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinical workflows, generalize across institutions, patient populations, fundamental to clinical, fail to generalize

备注：

点击查看摘要

Abstract:Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.

133. 【2603.05869】PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

链接：https://arxiv.org/abs/2603.05869

作者：Yukun Qi,Pei Fu,Hang Li,Yuhan Liu,Chao Jiang,Bin Qin,Zhenbo Luo,Jian Luan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, challenging multimodal understanding, achieved remarkable, remarkable progress, wide range

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

134. 【2603.05867】umorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis

链接：https://arxiv.org/abs/2603.05867

作者：Sijing Li,Zhongwei Qiu,Jiang Liu,Wenqiao Zhang,Tianwei Lin,Yihan Xie,Jianxiang An,Boxiang Yun,Chenglin Yang,Jun Xiao,Guangyu Guo,Jiawen Yao,Wei Liu,Yuan Gao,Ke Yan,Weiwei Cao,Zhilin Zheng,Tony C. W. Mok,Kai Cao,Yu Shi,Jiuyu Zhang,Jian Zhou,Beng Chin Ooi,Yingda Xia,Ling Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：assessment guide diagnosis, Accurate tumor analysis, pathology-level risk assessment, risk assessment guide, Accurate tumor

备注： Accepted at ICLR 2026. 10 pages + appendix

点击查看摘要

Abstract:Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at this https URL.

135. 【2603.05860】Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

链接：https://arxiv.org/abs/2603.05860

作者：Lin Fan,Pengyu Dai,Zhipeng Deng,Haolin Wang,Xun Gong,Yefeng Zheng,Yafei Ou

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：clinicians iteratively combine, iteratively combine visual, combine visual evidence, Clinical image interpretation, quantify findings

备注： 18 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets and invocation strategies as static after deployment. This design is brittle under real-world domain shifts, across tasks, and evolving diagnostic requirements, where predefined tool chains frequently degrade and demand costly manual re-design. We propose MACRO, a self-evolving, experience-augmented medical agent that shifts from static tool composition to experience-driven tool discovery. From verified execution trajectories, the agent autonomously identifies recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives that continuously expand its behavioral repertoire. A lightweight image-feature memory grounds tool selection in a visual-clinical context, while a GRPO-like training loop reinforces reliable invocation of discovered composites, enabling closed-loop self-improvement with minimal supervision. Extensive experiments across diverse medical imaging datasets and tasks demonstrate that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and recent state-of-the-art agentic methods, bridging the gap between brittle static tool use and adaptive, context-aware clinical AI assistance. Code will be available upon acceptance.

136. 【2603.05851】VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction

链接：https://arxiv.org/abs/2603.05851

作者：Muhua Zhu,Xinhao Jin,Yu Zhang,Yifei Xue,Tie Ji,Yizhen Lao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mitigate camera shake, Video stabilization aims, aims to mitigate, shake but faces, faces a fundamental

备注：

点击查看摘要

Abstract:Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.

137. 【2603.05845】Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation

链接：https://arxiv.org/abs/2603.05845

作者：Haonan Wang,Hanyu Zhou,Haoyue Liu,Tao Gu,Luxin Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：producing semantically plausible, spatial geometry constraints, Generative models, semantically plausible, achieved success

备注：

点击查看摘要

Abstract:Generative models have achieved success in producing semantically plausible 2D images, but it remains challenging in 3D generation due to the absence of spatial geometry constraints. Typically, existing methods utilize geometric features as conditions to enhance spatial awareness. However, these methods can only model relative relationships and are prone to scale inconsistency of absolute geometry. Thus, we argue that semantic information and absolute geometry empower 3D cognition, thereby enabling controllable 3D generation for the physical world. In this work, we propose Cog2Gen3D, a 3D cognition-guided diffusion framework for 3D generation. Our model is guided by three key designs: 1) Cognitive Feature Embeddings. We encode different modalities into semantic and geometric representations and further extract logical representations. 2) 3D Latent Cognition Graph. We structure different representations into dual-stream semantic-geometric graphs and fuse them via common-based cross-attention to obtain a 3D cognition graph. 3) Cognition-Guided Latent Diffusion. We leverage the fused 3D cognition graph as the condition to guide the latent diffusion process for 3D Gaussian generation. Under this unified framework, the 3D cognition graph ensures the physical plausibility and structural rationality of 3D generation. Moreover, we construct a validation subset based on the Marble World Labs. Extensive experiments demonstrate that our Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.

138. 【2603.05844】Remote Sensing Image Classification Using Deep Ensemble Learning

链接：https://arxiv.org/abs/2603.05844

作者：Niful Islam,Md. Rayhan Ahmed,Nur Mohammad Fahad,Salekul Islam,A.K.M. Muzahidul Islam,Saddam Mukta,Swakkhar Shatabda

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：requires accurate computerized, computerized classification techniques, accurate computerized classification, Convolutional Neural Networks, sensing imagery plays

备注：

点击查看摘要

Abstract:Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.

139. 【2603.05812】Margin and Consistency Supervision for Calibrated and Robust Vision Models

链接：https://arxiv.org/abs/2603.05812

作者：Salim Khazem

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Deep vision classifiers, small distribution shifts, remaining poorly calibrated, achieve high accuracy, Deep vision

备注：

点击查看摘要

Abstract:Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.

140. 【2603.05811】raining-free Latent Inter-Frame Pruning with Attention Recovery

链接：https://arxiv.org/abs/2603.05811

作者：Dennis Menn,Yuedong Yang,Bokun Wang,Xiwen Wei,Mustafa Munir,Feng Liang,Radu Marculescu,Chenfeng Xu,Diana Marculescu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high computational latency, making real-time applications, applications prohibitively costly, real-time applications prohibitively, Current video generation

备注：

点击查看摘要

Abstract:Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.

141. 【2603.05807】EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

链接：https://arxiv.org/abs/2603.05807

作者：Adam D. Hines,Gokul B. Nair,Nicolás Marticorena,Michael Milford,Tobias Fischer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Dynamic vision sensors, Dynamic vision, high-temporal resolution, computer vision tasks, vision tasks due

备注： 10 pages, 4 figures, 5 tables, under review

点击查看摘要

Abstract:Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: this https URL

142. 【2603.05787】Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction

链接：https://arxiv.org/abs/2603.05787

作者：Ling Xiao,Yuliang Xiu,Yue Chen,Guoming Wang,Toshihiko Yamasaki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Model, Vision Foundation, Foundation Model, images as input, multi-view images

备注：

点击查看摘要

Abstract:A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.

143. 【2603.05781】Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

链接：https://arxiv.org/abs/2603.05781

作者：Donghoon Han,Eunhwan Park,Seunghyeon Seo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：offers limited interpretability, Vision Transformer patch, interpretability and attribution, compute-intensive at scale, Dense image retrieval

备注：

点击查看摘要

Abstract:Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

144. 【2603.05769】Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

链接：https://arxiv.org/abs/2603.05769

作者：Ruidong Chen,Yancheng Bai,Xuanpu Zhang,Jianhao Zeng,Lanjun Wang,Dan Song,Lei Sun,Xiangxiang Chu,Anan Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：limiting real-world usability, Region-instructed layout control, training-based approaches inherit, current techniques struggle, degrade image quality

备注： Accepted by CVPR26

点击查看摘要

Abstract:Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.

145. 【2603.05768】Bridging Domains through Subspace-Aware Model Merging

链接：https://arxiv.org/abs/2603.05768

作者：Levy Chaves,Chao Zhou,Rebekka Burkholz,Eduardo Valle,Sandra Avila

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：integrates multiple task-specific, multiple task-specific models, merging integrates multiple, integrates multiple, multiple task-specific

备注： Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (Main Track)

点击查看摘要

Abstract:Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.

146. 【2603.05758】Full Dynamic Range Sky-Modelling For Image Based Lighting

链接：https://arxiv.org/abs/2603.05758

作者：Ian J. Maquignaz

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词：modelling real-world outdoor, key component, component to modelling, modelling real-world, real-world outdoor scenes

备注：

点击查看摘要

Abstract:Accurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).

147. 【2603.05732】From Phase Grounding to Intelligent Surgical Narratives

链接：https://arxiv.org/abs/2603.05732

作者：Ethan Peterson,Huixin Zhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：important part, part of tool-assisted, key parts, Video surgery timelines, surgical video frames

备注：

点击查看摘要

Abstract:Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.

148. 【2603.05729】Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

链接：https://arxiv.org/abs/2603.05729

作者：Junyu Chen,Md Yousuf Harun,Christopher Kanan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：images depicting multiple, depicting multiple objects, images depicting, Multi-label annotations, ImageNet benchmark enforces

备注： Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at this https URL.

149. 【2603.05711】Any to Full: Prompting Depth Anything for Depth Completion in One Stage

链接：https://arxiv.org/abs/2603.05711

作者：Zhiyuan Zhou,Ruofeng Liu,Taichi Liu,Weijian Zuo,Shanshan Wang,Zhiqing Hong,Desheng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：incomplete measurements due, dense depth estimation, robotic perception, hardware limitations, crucial for robotic

备注：

点击查看摘要

Abstract:Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2\% in average AbsREL and delivers a 1.4$\times$ speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at this https URL.

150. 【2603.05708】Interpretable Perception and Reasoning for Audiovisual Geolocation

链接：https://arxiv.org/abs/2603.05708

作者：Yiyang Su,Xiaoming Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, formidable challenge due

备注：

点击查看摘要

Abstract:While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

151. 【2603.05697】MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

链接：https://arxiv.org/abs/2603.05697

作者：Dannong Xu,Zhongyu Yang,Jun Chen,Yingfang Yuan,Ming Hu,Lei Sun,Luc Van Gool,Danda Pani Paudel,Chun-Mei Feng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video understanding separately, understanding separately, Multimodal large language, large language models, large language

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.

152. 【2603.05686】OWL: A Novel Approach to Machine Perception During Motion

链接：https://arxiv.org/abs/2603.05686

作者：Daniel Raviv,Juan D. Yepes

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual motion cues, visual motion, perception-related function, designed to address, motion

备注：

点击查看摘要

Abstract:We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.

153. 【2603.05663】Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

链接：https://arxiv.org/abs/2603.05663

作者：Jiaqi Li,Shuntian Zheng,Yixian Shen,Jia-Hong Huang,Xiaoman Lu,Minzhe Ni,Yu Guan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Video Temporal Grounding, pipelines prohibitively expensive, Temporal Grounding, Video Temporal, moment in long

备注：

点击查看摘要

Abstract:Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.

154. 【2603.05659】When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

链接：https://arxiv.org/abs/2603.05659

作者：Wisdom Ikezogwo,Mehmet Saygin Seyfioglu,Ranjay Krishna,Karim Bouyarmane

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：driven strong gains, Reinforcement learning, synthesizing evaluation criteria, clear correctness signals, learning with verifiable

备注：

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that naïve explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.

155. 【2603.05630】Making Reconstruction FID Predictive of Diffusion Generation FID

链接：https://arxiv.org/abs/2603.05630

作者：Tongda Xu,Mingwei He,Shady Abu-Hussein,Jose Miguel Hernandez-Lobato,Haotian Zhang,Kai Zhao,Chao Zhou,Ya-Qin Zhang,Yan Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：generation FID, FID, latent diffusion model, gFID, propose interpolated FID

备注：

点击查看摘要

Abstract:It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in this https URL.

156. 【2603.05629】Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

链接：https://arxiv.org/abs/2603.05629

作者：Merve Tapli,Quentin Bouniot,Wolfgang Stammer,Zeynep Akata,Emre Akbas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：face fundamental limitations, Concept Bottleneck, causing recent CBMs, Concept Bottleneck Models, pre-evaluate concept relevance

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the "linearity problem" causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.

157. 【2603.05623】Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection

链接：https://arxiv.org/abs/2603.05623

作者：Trung Tien Dong,Dev Thakkar,Arman Sargolzaei,Xiaomin Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：enable accurate, autonomous driving, driving to enable, Post Fusion Stabilizer, Camera-LiDAR fusion

备注：

点击查看摘要

Abstract:Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird's-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.05623 [cs.CV]

(or
arXiv:2603.05623v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.05623

Focus to learn more

              arXiv-issued DOI via DataCite</p>

158. 【2603.05622】Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening

链接：https://arxiv.org/abs/2603.05622

作者：Lei Tong,Xujing Yao,Adam Corrigan,Long Chen,Navin Rathna Kumar,Kerry Hallbrook,Jonathan Orme,Yinhai Wang,Huiyu Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：High-Content Screening routinely, Screening routinely generates, routinely generates massive, generates massive volumes, High-Content Screening

备注： Preprint

点击查看摘要

Abstract:High-Content Screening routinely generates massive volumes of cell painting images for phenotypic profiling. However, technical variations across experimental executions inevitably induce biological batch (bio-batch) effects. These cause covariate shifts and degrade the generalization of deep learning models on unseen data. Existing batch correction methods typically rely on additional prior knowledge (e.g., treatment or cell culture information) or struggle to generalize to unseen bio-batches. In this work, we frame bio-batch mitigation as a Domain Generalization (DG) problem and propose Adversarial Batch Representation Augmentation (ABRA). ABRA explicitly models batch-wise statistical fluctuations by parameterizing feature statistics as structured uncertainties. Through a min-max optimization framework, it actively synthesizes worst-case bio-batch perturbations in the representation space, guided by a strict angular geometric margin to preserve fine-grained class discriminability. To prevent representation collapse during this adversarial exploration, we introduce a synergistic distribution alignment objective. Extensive evaluations on the large-scale RxRx1 and RxRx1-WILDS benchmarks demonstrate that ABRA establishes a new state-of-the-art for siRNA perturbation classification.

159. 【2603.05607】DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces

链接：https://arxiv.org/abs/2603.05607

作者：Mohammad Sadil Khan,Muhammad Usama,Rolandos Alexandros Potamias,Didier Stricker,Muhammad Zeshan Afzal,Jiankang Deng,Ismail Elezi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：explicit design histories, Computer-Aided Design, explicit design, design histories, boundary representation

备注： For Caption Dataset: [this https URL](https://huggingface.co/datasets/SadilKhan/CADCap-1M)

点击查看摘要

Abstract:Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., Bézier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.

160. 【2603.05604】From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

链接：https://arxiv.org/abs/2603.05604

作者：Xusheng Luo,Changliu Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：including pose estimation, small input perturbations, viewpoint recovery, modern neural models, including pose

备注： 21 pages, 4 figures, 9 tables. arXiv admin note: text overlap with [arXiv:2408.00117](https://arxiv.org/abs/2408.00117)

点击查看摘要

Abstract:Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.

161. 【2603.05591】hinking with Spatial Code for Physical-World Video Reasoning

链接：https://arxiv.org/abs/2603.05591

作者：Jieneng Chen,Wenxin Ma,Ruisheng Yuan,Yunzhi Zhang,Jiajun Wu,Alan Yuille

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：transforms RGB video, visual question answering, physical-world visual question, introduce Thinking, transforms RGB

备注： Code at [this https URL](https://github.com/Beckschen/spatialcode)

点击查看摘要

Abstract:We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at this https URL.

162. 【2603.05582】Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

链接：https://arxiv.org/abs/2603.05582

作者：Ivan Luiz De Moura Matos,Abdel Djalil Sad Saoud,Ekaterina Iakovleva,Vito Paolo Pastore,Enzo Tartaglione

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：perform complex training, complex training procedures, debiasing techniques, dataset manipulation, issue of algorithmic

备注： This work has been accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates "bias-free" subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model.

163. 【2603.05551】AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

链接：https://arxiv.org/abs/2603.05551

作者：Jiashu Yang,Chi Zhang,Abudukelimu Wuerkaixi,Xuxin Cheng,Cao Liu,Ke Zeng,Xu Jia,Xunliang Cai

类目：Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Information-intensive Document Question, Document Question Answering, Question Answering, performing precise direct, hinders Vision-Language Models

备注：

点击查看摘要

164. 【2603.05546】Digital-Twin Losses for Lane-Compliant Trajectory Prediction at Urban Intersections

链接：https://arxiv.org/abs/2603.05546

作者：Kuo-Yi Chao,Erik Leo Haß,Melina Gegg,Jiajie Zhang,Ralph Raßhofer,Alois Christian Knoll

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：complex multi-agent interactions, Accurate and safety-conscious, safety-conscious trajectory prediction, key technology, environments with complex

备注： 7 pages, 2 figures, conference

点击查看摘要

Abstract:Accurate and safety-conscious trajectory prediction is a key technology for intelligent transportation systems, especially in V2X-enabled urban environments with complex multi-agent interactions. In this paper, we created a digital twin-driven V2X trajectory prediction pipeline that jointly leverages cooperative perception from vehicles and infrastructure to forecast multi-agent motion at signalized intersections. The proposed model combines a Bi-LSTM-based generator with a structured training objective consisting of a standard mean squared error (MSE) loss and a novel twin loss. The twin loss encodes infrastructure constraints, collision avoidance, diversity across predicted modes, and rule-based priors derived from the digital twin. While the MSE term ensures point-wise accuracy, the twin loss penalizes traffic rule violations, predicted collisions, and mode collapse, guiding the model toward scene-consistent and safety-compliant predictions. We train and evaluate our approach on real-world V2X data sent from the intersection to the vehicle and collected in urban corridors. In addition to standard trajectory metrics (ADE, FDE), we introduce ITS-relevant safety indicators, including infrastructure and rule violation rates. Experimental results demonstrate that the proposed training scheme significantly reduces critical violations while maintaining comparable prediction accuracy and real-time performance, highlighting the potential of digital twin-driven multi-loss learning for V2X-enabled intelligent transportation systems.

Comments:
7 pages, 2 figures, conference

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.05546 [cs.RO]

(or
arXiv:2603.05546v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2603.05546

Focus to learn more

              arXiv-issued DOI via DataCite</p>

165. 【2603.05537】Edges Are All You Need: Robust Gait Recognition via Label-Free Structure

链接：https://arxiv.org/abs/2603.05537

作者：Chao Zhang,Zhuang Zheng,Ruixin Li,Zhanyong Mei

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：non-intrusive biometric technique, security applications, non-intrusive biometric, biometric technique, technique for security

备注： 8 pages, 2 figures

点击查看摘要

Abstract:Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.

166. 【2603.05528】Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

链接：https://arxiv.org/abs/2603.05528

作者：Kin Wai Lau,Yasar Abbas Ur Rehman,Lai-Man Po,Pedro Porto Buarque de Gusmão

类目：Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：linearly scaling complexity, Recent multimodal systems, Recent multimodal, rely on separate, linearly scaling

备注：

点击查看摘要

167. 【2603.05522】RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

链接：https://arxiv.org/abs/2603.05522

作者：Ali Shamsaddinlou

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：vision language models, open-ended language instructions, Recent advances, shown strong potential, language models

备注：

点击查看摘要

Abstract:Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.

168. 【2603.05518】CoEditor++: Instruction-based Visual Editing via Cognitive Reasoning

链接：https://arxiv.org/abs/2603.05518

作者：Minheng Ni,Yutao Fan,Zhengyuan Yang,Yeli Shen,Yuxiang Wei,Yaowen Zhang,Lijuan Wang,Lei Zhang,Wangmeng Zuo

类目：Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language descriptions, Recent advances, modify visual content, large multimodal models, allowing users

备注：

点击查看摘要

Abstract:Recent advances in large multimodal models (LMMs) have enabled instruction-based image editing, allowing users to modify visual content via natural language descriptions. However, existing approaches often struggle with high-level semantic reasoning and visual consistency, particularly under ambiguous or complex instructions. To address these challenges, we propose CoEditor++, a cognitively structured, training-free framework that decomposes editing into "what to edit" and "how to edit" through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing. Built entirely from open-sourced components, CoEditor++ requires no additional training or fine-tuning, ensuring transparency and cross-domain applicability. We evaluate CoEditor++ on SmartEdit, a widely used benchmark for general editing, and AltBear, a privacy and compliance-oriented benchmark. Experimental results show that CoEditor++ achieves state-of-the-art performance in both general editing and responsible editing tasks compared with open-sourced models that require training on specialized editing datasets maintaining significantly higher visual consistency. When compared with closed-source models such as Nano Banana Pro or GPT-4o, CoEditor++ preserves comparable instruction following while still substantially outperforming them in visual consistency. Extensive ablation studies confirm that the effectiveness of CoEditor++ benefits from its structured cognitive design rather than any specific model component. Our findings suggest the potential toward cognitive-centric instruction-based image editing.

169. 【2603.06095】Enhancing Neural Video Compression of Static Scenes with Positive-Incentive Noise

链接：https://arxiv.org/abs/2603.06095

作者：Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Static scene videos, videotelephony streams, constitute a dominant, feeds and videotelephony, dominant share

备注：

点击查看摘要

Abstract:Static scene videos, such as surveillance feeds and videotelephony streams, constitute a dominant share of storage consumption and network traffic. However, both traditional standardized codecs and neural video compression (NVC) methods struggle to encode these videos efficiently due to inadequate usage of temporal redundancy and severe distribution gaps between training and test data, respectively. While recent generative compression methods improve perceptual quality, they introduce hallucinated details that are unacceptable in authenticity-critical applications. To overcome these limitations, we propose to incorporate positive-incentive noise into NVC for static scene videos, where short-term temporal changes are reinterpreted as positive-incentive noise to facilitate model finetuning. By disentangling transient variations from the persistent background, structured prior information is internalized in the compression model. During inference, the invariant component requires minimal signaling, thus reducing data transmission while maintaining pixel-level fidelity. Preliminary experiments demonstrate a 73% Bjøntegaard delta (BD) rate saving compared to general NVC models. Our method provides an effective solution to trade computation for bandwidth, enabling robust video transmission under adverse network conditions and economic long-term retention of surveillance footage.

170. 【2603.05834】Architectural Unification for Polarimetric Imaging Across Multiple Degradations

链接：https://arxiv.org/abs/2603.05834

作者：Chu Zhou,Yufei Han,Junda Liao,Linrui Dai,Wangze Xu,Art Subpa-Asa,Heng Guo,Boxin Shi,Imari Sato

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：including Total Intensity, Degree of Polarization, Angle of Polarization, Total Intensity, including Total

备注：

点击查看摘要

Abstract:Polarimetric imaging aims to recover polarimetric parameters, including Total Intensity (TI), Degree of Polarization (DoP), and Angle of Polarization (AoP), from captured polarized measurements. In real-world scenarios, these measurements are frequently affected by diverse degradations such as low-light noise, motion blur, and mosaicing artifacts. Due to the nonlinear dependency of DoP and AoP on the measured intensities, accurately retrieving physically consistent polarimetric parameters from degraded observations remains highly challenging. Existing approaches typically adopt task-specific network architectures tailored to individual degradation types, limiting their adaptability across different restoration scenarios. Moreover, many methods rely on multi-stage processing pipelines that suffer from error accumulation, or operate solely in a single domain (either image or Stokes domain), failing to fully exploit the intrinsic physical relationships between them. In this work, we propose a unified architectural framework for polarimetric imaging that is structurally shared across multiple degradation scenarios. Rather than redesigning network structures for each task, our framework maintains a consistent architectural design while being trained separately for different degradations. The model performs single-stage joint image-Stokes processing, avoiding error accumulation and explicitly preserving physical consistency. Extensive experiments show that this unified architectural design, when trained for specific degradation types, consistently achieves state-of-the-art performance across low-light denoising, motion deblurring, and demosaicing tasks, establishing a versatile and physically grounded solution for degraded polarimetric imaging.

171. 【2603.05756】Uni-LVC: A Unified Method for Intra- and Inter-Mode Learned Video Compression

链接：https://arxiv.org/abs/2603.05756

作者：Yichi Zhang,Ruoyu Yang,Fengqing Zhu

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：learned video compression, Recent advances, DCVC-RT surpassing, significant performance gains, VVC low-delay mode

备注：

点击查看摘要

Abstract:Recent advances in learned video compression (LVC) have led to significant performance gains, with codecs such as DCVC-RT surpassing the H.266/VVC low-delay mode in compression efficiency. However, existing LVCs still exhibit key limitations: they often require separate models for intra and inter coding modes, and their performance degrades when temporal references are unreliable. To address this, we introduce Uni-LVC, a unified LVC method that supports both intra and inter coding with low-delay and random-access in a single model. Building on a strong intra-codec, Uni-LVC formulates inter-coding as intra-coding conditioned on temporal information extracted from reference frames. We design an efficient cross-attention adaptation module that integrates temporal cues, enabling seamless support for both unidirectional (low-delay) and bidirectional (random-access) prediction modes. A reliability-aware classifier is proposed to selectively scale the temporal cues, making Uni-LVC behave closer to intra coding when references are unreliable. We further propose a multistage training strategy to facilitate adaptive learning across various coding modes. Extensive experiments demonstrate that Uni-LVC achieves superior rate-distortion performance in intra and inter configurations while maintaining comparable computational efficiency.

172. 【2603.05726】Interpretable Motion Artificat Detection in structural Brain MRI

链接：https://arxiv.org/abs/2603.05726

作者：Naveetha Nithianandam,Prabhjot Kaur,Anil Kumar Sao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, structural brain MRI, reliable neuroimaging analysis, Automated quality assessment, poor generalization

备注：

点击查看摘要

Abstract:Automated quality assessment of structural brain MRI is an important prerequisite for reliable neuroimaging analysis, but yet remains challenging due to motion artifacts and poor generalization across acquisition sites. Existing approaches based on image quality metrics (IQMs) or deep learning either requires extensive preprocessing, which incurs high computational cost, or poor generalization to unseen data. In this work, we propose a lightweight and interpretable framework for detecting motion related artifacts in T1 weighted brain MRI by extending the Discriminative Histogram of Gradient Magnitude (DHoGM) to a three dimensional space. The proposed method integrates complementary slice-level (2D) and volume-level (3D) DHoGM features through a parallel decision strategy, capturing both localized and global motion-induced degradation. Volumetric analysis is performed using overlapping 3D cuboids to achieve comprehensive spatial coverage while maintaining computational efficiency. A simple threshold-based classifier and a low parameter multilayer perceptron are used, which results in a model with only 209 trainable parameters. Our method was evaluated on the MR-ART and ABIDE datasets under both seen-site and unseen-site conditions. Experimental results demonstrate strong performance, achieving up to 94.34\% accuracy the in domain evaluation and 89\% accuracy on unseen sites, while almost completely avoiding false acceptance of poor-quality scans. Ablation studies confirms the complementary benefits of combining 2D and 3D features. Overall, the proposed approach offers an effective, efficient, and robust solution for automated MRI quality check, with strong potential for integration into large scale clinical and research workflows.

173. 【2603.05693】Longitudinal Lesion Inpainting in Brain MRI via 3D Region Aware Diffusion

链接：https://arxiv.org/abs/2603.05693

作者：Zahra Karimaghaloo,Dumitru Fetco,Haz-Edine Assemlal,Hassan Rivaz,Douglas L. Arnold

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：automated neuroimaging pipelines, Accurate longitudinal analysis, bias automated neuroimaging, neuroimaging pipelines, Accurate longitudinal

备注：

点击查看摘要

Abstract:Accurate longitudinal analysis of brain MRI is often hindered by evolving lesions, which bias automated neuroimaging pipelines. While deep generative models have shown promise in inpainting these lesions, most existing methods operate cross-sectionally or lack 3D anatomical continuity. We present a novel pseudo-3D longitudinal inpainting framework based on Denoising Diffusion Probabilistic Models (DDPM). Our approach utilizes multi-channel conditioning to incorporate longitudinal context from distinct visits (t_1, t_2) and extends Region-Aware Diffusion (RAD) to the medical domain, focusing the generative process on pathological regions without altering surrounding healthy tissue. We evaluated our model against state-of-the-art baselines on longitudinal brain MRI from 93 patients. Our model significantly outperforms the leading baseline (FastSurfer-LIT) in terms of perceptual fidelity, reducing the Learned Perceptual Image Patch Similarity (LPIPS) distance from 0.07 to 0.03 while effectively eliminating inter-slice discontinuities. Furthermore, our model demonstrates high longitudinal stability with a Temporal Fidelity Index of 1.024, closely approaching the ideal value of 1.0 and substantially narrowing the gap compared to LIT's TFI of 1.22. Notably, the RAD mechanism provides a substantial gain in efficiency; our framework achieves an average processing time of 2.53 min per volume, representing approximately 10x speedup over the 24.30 min required by LIT. By leveraging longitudinal priors and region-specific denoising, our framework provides a highly reliable and efficient preprocessing step for the study of progressive neurodegenerative diseases. A derivative dataset consisting of 93 pre-processed scans used for testing will be available upon request after acceptance. Code will be released upon acceptance.

174. 【2603.05681】Gabor Primitives for Accelerated Cardiac Cine MRI Reconstruction

链接：https://arxiv.org/abs/2603.05681

作者：Wenqi Huang,Veronika Spieker,Nil Stolt-Ansó,Natascha Niessen,Maik Dannecker,Sevgi Gokce Kafali,Sila Kurugol,Julia A. Schnabel,Daniel Rueckert

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：MRI requires reconstructing, Accelerated cardiac cine, highly undersampled k-space, requires reconstructing spatiotemporal, reconstructing spatiotemporal images

备注：

点击查看摘要

Abstract:Accelerated cardiac cine MRI requires reconstructing spatiotemporal images from highly undersampled k-space data. Implicit neural representations (INRs) enable scan-specific reconstruction without large training datasets, but encode content implicitly in network weights without physically interpretable parameters. Gaussian primitives provide an explicit and geometrically interpretable alternative, but their spectra are confined near the k-space origin, limiting high-frequency representation. We propose Gabor primitives for MRI reconstruction, modulating each Gaussian envelope with a complex exponential to place its spectral support at an arbitrary k-space location, enabling efficient representation of both smooth structures and sharp boundaries. To exploit spatiotemporal redundancy in cardiac cine, we decompose per-primitive temporal variation into a low-rank geometry basis capturing cardiac motion and a signal-intensity basis modeling contrast changes. Experiments on cardiac cine data with Cartesian and radial trajectories show that Gabor primitives consistently outperform compressed sensing, Gaussian primitives, and hash-grid INR baselines, while providing a compact, continuous-resolution representation with physically meaningful parameters.

175. 【2603.05535】Clinical-Injection Transformer with Domain-Adapted MAE for Lupus Nephritis Prognosis Prediction

链接：https://arxiv.org/abs/2603.05535

作者：Yuewen Huang,Zhitao Ye,Guangnan Feng,Fudan Zheng,Xia Gao,Yutong Lu

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：systemic lupus erythematosus, significantly greater severity, worse renal outcomes, renal outcomes compared, Lupus nephritis

备注：

点击查看摘要

Abstract:Lupus nephritis (LN) is a severe complication of systemic lupus erythematosus that affects pediatric patients with significantly greater severity and worse renal outcomes compared to adults. Despite the urgent clinical need, predicting pediatric LN prognosis remains unexplored in computational pathology. Furthermore, the only existing histopathology-based approach for LN relies on multiple costly staining protocols and fails to integrate complementary clinical data. To address these gaps, we propose the first multimodal computational pathology framework for three-class treatment response prediction (complete remission, partial response, and no response) in pediatric LN, utilizing only routine PAS-stained biopsies and structured clinical data. Our framework introduces two key methodological innovations. First, a Clinical-Injection Transformer (CIT) embeds clinical features as condition tokens into patch-level self-attention, facilitating implicit and bidirectional cross-modal interactions within a unified attention space. Second, we design a decoupled representation-knowledge adaptation strategy using a domain-adapted Masked Autoencoder (MAE). This strategy explicitly separates self-supervised morphological feature learning from pathological knowledge extraction. Additionally, we introduce a multi-granularity morphological type injection mechanism to bridge distilled classification knowledge with downstream prognostic predictions at both the instance and patient levels. Evaluated on a cohort of 71 pediatric LN patients with KDIGO-standardized labels, our method achieves a three-class accuracy of 90.1% and an AUC of 89.4%, demonstrating its potential as a highly accurate and cost-effective prognostic tool.