本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新1280篇论文,其中:
- 自然语言处理136篇
- 信息检索35篇
- 计算机视觉322篇
自然语言处理
1. 【2603.08706】Agentic Critical Training
链接:https://arxiv.org/abs/2603.08706
作者:Weize Liu,Minghui Liu,Sy-Tuyen Ho,Souradip Chakraborty,Xiyao Wang,Furong Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Training large language, large language models, large language, lack awareness, ACT
备注: Project page: [this https URL](https://attention-is-all-i-need.github.io/ACT/)
点击查看摘要
Abstract:Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.
2. 【2603.08660】How Far Can Unsupervised RLVR Scale LLM Training?
链接:https://arxiv.org/abs/2603.08660
作者:Bingxiang He,Yuxin Zuo,Zeyuan Liu,Shangziqi Zhao,Zixuan Fu,Junlin Yang,Cheng Qian,Kaiyan Zhang,Yuchen Fan,Ganqu Cui,Xiusi Chen,Youbang Sun,Xingtai Lv,Xuekai Zhu,Li Sheng,Ran Li,Huan-ang Gao,Yuchen Zhang,Bowen Zhou,Zhiyuan Liu,Ning Ding
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Unsupervised reinforcement learning, scale LLM training, Unsupervised reinforcement, scale LLM, ground truth labels
备注: Accepted to the ICLR 2026
点击查看摘要
Abstract:Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
3. 【2603.08659】CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning
链接:https://arxiv.org/abs/2603.08659
作者:Siye Wu,Jian Xie,Yikai Zhang,Yanghua Xiao
类目:Computation and Language (cs.CL)
关键词:scaling inference-time compute, inference-time compute significantly, compute significantly enhances, significantly enhances performance, emergence of large
备注:
点击查看摘要
Abstract:The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.
4. 【2603.08655】OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
链接:https://arxiv.org/abs/2603.08655
作者:Krista Opsahl-Ong,Arnav Singhvi,Jasmine Collins,Ivan Zhou,Cindy Wang,Ashutosh Baheti,Owen Oertell,Jacob Portes,Sam Havens,Erich Elsen,Michael Bendersky,Matei Zaharia,Xing Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:introduce OfficeQA Pro, Treasury Bulletins spanning, OfficeQA Pro, benchmark for evaluating, large and heterogeneous
备注: 24 pages, 16 figures. Introduces the OfficeQA Pro benchmark for grounded reasoning over enterprise documents
点击查看摘要
Abstract:We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
5. 【2603.08578】Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates
链接:https://arxiv.org/abs/2603.08578
作者:Ismail Lamaakal,Chaymae Yahyati,Khalid El Makkaoui,Ibrahim Ouahbi,Yassine Maleh
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Deployed machine learning, machine learning systems, learning systems face, systems face distribution, Deployed machine
备注: Published as a conference paper at CAO Workshop at ICLR 2026
点击查看摘要
Abstract:Deployed machine learning systems face distribution drift, yet most monitoring pipelines stop at alarms and leave the response underspecified under labeling, compute, and latency constraints. We introduce Drift2Act, a drift-to-action controller that treats monitoring as constrained decision-making with explicit safety. Drift2Act combines a sensing layer that maps unlabeled monitoring signals to a belief over drift types with an active risk certificate that queries a small set of delayed labels from a recent window to produce an anytime-valid upper bound $U_t(\delta)$ on current risk. The certificate gates operation: if $U_t(\delta) \le \tau$, the controller selects low-cost actions (e.g., recalibration or test-time adaptation); if $U_t(\delta) \tau$, it activates abstain/handoff and escalates to rollback or retraining under cooldowns. In a realistic streaming protocol with label delay and explicit intervention costs, Drift2Act achieves near-zero safety violations and fast recovery at moderate cost on WILDS Camelyon17, DomainNet, and a controlled synthetic drift stream, outperforming alarm-only monitoring, adapt-always adaptation, schedule-based retraining, selective prediction alone, and an ablation without certification. Overall, online risk certification enables reliable drift response and reframes monitoring as decision-making with safety.
6. 【2603.08501】Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA
链接:https://arxiv.org/abs/2603.08501
作者:Ummar Abbas,Mourad Ouzzani,Mohamed Y. Eltabakh,Omar Sinan,Gagan Bhatia,Hamdy Mubarak,Majd Hawasly,Mohammed Qusay Hashim,Kareem Darwish,Firoj Alam
类目:Computation and Language (cs.CL)
关键词:Large language models, Qur'an and Hadith, users expect grounding, knowledge queries fluently, religious knowledge queries
备注:
点击查看摘要
Abstract:Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) reduces some of these limitations by grounding generation in external evidence. However, a single ``retrieve-then-generate'' pipeline is limited to deal with the diversity of Islamic this http URL may request verbatim scripture, fatwa-style guidance with citations or rule-constrained computations such as zakat and inheritance that require strict arithmetic and legal invariants. In this work, we present a bilingual (Arabic/English) multi-agent Islamic assistant, called Fanar-Sadiq, which is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic-related queries to specialized modules within an agentic, tool-using architecture. The system supports intent-aware routing, retrieval-grounded fiqh answers with deterministic citation normalization and verification traces, exact verse lookup with quotation validation, and deterministic calculators for Sunni zakat and inheritance with madhhab-sensitive branching. We evaluate the complete end-to-end system on public Islamic QA benchmarks and demonstrate effectiveness and efficiency. Our system is currently publicly and freely accessible through API and a Web application, and has been accessed $\approx$1.9M times in less than a year.
7. 【2603.08453】LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing
链接:https://arxiv.org/abs/2603.08453
作者:Dongfang Li,Zixuan Liu,Gang Lin,Baotian Hu,Min Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, processing long contexts, Large Language, substantial memory footprint, present severe computational
备注: 17 pages, 12 figures
点击查看摘要
Abstract:The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.
8. 【2603.08450】A Dataset for Probing Translationese Preferences in English-to-Swedish Translation
链接:https://arxiv.org/abs/2603.08450
作者:Jenny Kunz,Anja Jarochenko,Marcel Bollmann
类目:Computation and Language (cs.CL)
关键词:carry traces, translationese, English source sentence, Translations, dataset contrasting translationese
备注: To appear at LREC 2026
点击查看摘要
Abstract:Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.
9. 【2603.08448】A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
链接:https://arxiv.org/abs/2603.08448
作者:Peter Brodeur,Jacob M. Koshy,Anil Palepu,Khaled Saab,Ava Homiar,Roma Ruparel,Charles Wu,Ryutaro Tanno,Joseph Xu,Amy Wang,David Stutz,Hannah M. Ferrera,David Barrett,Lindsey Crowley,Jihyeon Lee,Spencer E. Rittner,Ellery Wulczyn,Selena K. Zhang,Elahe Vedadi,Christine G. Kohn,Kavita Kulkarni,Vinay Kadiyala,Sara Mahdavi,Wendy Du,Jessica Williams,David Feinbloom,Renee Wong,Tao Tu,Petar Sirkovic,Alessio Orlandi,Christopher Semturs,Yun Liu,Juraj Gottweis,Dale R. Webster,Joëlle Barral,Katherine Chou,Pushmeet Kohli,Avinatan Hassidim,Yossi Matias,James Manyika,Rob Fields,Jonathan X. Li,Marc L. Cohen,Vivek Natarajan,Mike Schaekermann,Alan Karthikesalingam,Adam Rodman
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language model, Large language, Medical Intelligence Explorer, Articulate Medical Intelligence, language model
备注:
点击查看摘要
Abstract:Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.
10. 【2603.08436】Can Vision-Language Models Solve the Shell Game?
链接:https://arxiv.org/abs/2603.08436
作者:Tiedong Liu,Wee Sun Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Vision-Language Models, innate cognitive ability, innate cognitive, remains a critical, critical bottleneck
备注:
点击查看摘要
Abstract:Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at this https URL .
11. 【2603.08429】One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States
链接:https://arxiv.org/abs/2603.08429
作者:Bo Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:retrieve external knowledge, external knowledge typically, knowledge typically generate, separate embedding model, query as text
备注:
点击查看摘要
Abstract:LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.
12. 【2603.08412】Aligning to Illusions: Choice Blindness in Human and AI Feedback
链接:https://arxiv.org/abs/2603.08412
作者:Wenbin Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reinforcement Learning, stable internal states, reflect stable internal, assumes annotator preferences, annotator preferences reflect
备注: 16 pages, 6 figures, 2 tables
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.
13. 【2603.08406】Sandpiper: Orchestrated AI-Annotation for Educational Discourse at Scale
链接:https://arxiv.org/abs/2603.08406
作者:Daryl Hedley,Doug Pietrzak,Jorge Dias,Ian Burden,Bakhtawar Ahtisham,Zhuqian Zhou,Kirk Vanacore,Josh Marland,Rachel Slama,Justin Reich,Kenneth Koedinger,René Kizilcec
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:Digital educational environments, offers deep insights, Digital educational, instructional processes, educational environments
备注:
点击查看摘要
Abstract:Digital educational environments are expanding toward complex AI and human discourse, providing researchers with an abundance of data that offers deep insights into learning and instructional processes. However, traditional qualitative analysis remains a labor-intensive bottleneck, severely limiting the scale at which this research can be conducted. We present Sandpiper, a mixed-initiative system designed to serve as a bridge between high-volume conversational data and human qualitative expertise. By tightly coupling interactive researcher dashboards with agentic Large Language Model (LLM) engines, the platform enables scalable analysis without sacrificing methodological rigor. Sandpiper addresses critical barriers to AI adoption in education by implementing context-aware, automated de-identification workflows supported by secure, university-housed infrastructure to ensure data privacy. Furthermore, the system employs schema-constrained orchestration to eliminate LLM hallucinations and enforces strict adherence to qualitative codebooks. An integrated evaluations engine allows for the continuous benchmarking of AI performance against human labels, fostering an iterative approach to model refinement and validation. We propose a user study to evaluate the system's efficacy in improving research efficiency, inter-rater reliability, and researcher trust in AI-assisted qualitative workflows.
14. 【2603.08398】Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective
链接:https://arxiv.org/abs/2603.08398
作者:Liyuan Mao,Le Yu,Jing Zhou,Chujie Zheng,Bowen Yu,Chang Gao,Shixuan Liu,An Yang,Weinan Zhang,JunYang Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, possess intrinsic behavioral, reinforcement learning, intrinsic behavioral plasticity-akin, Language Models
备注: Work done during an internship at the Qwen Team, Alibaba Group
点击查看摘要
Abstract:In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.
15. 【2603.08392】COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling
链接:https://arxiv.org/abs/2603.08392
作者:Yee Man Ng,Bram van Dijk,Pieter Beynen,Otto Boekesteijn,Joris Jansen,Gerard van Oortmerssen,Max van Duijn,Marco Spruit
类目:Computation and Language (cs.CL)
关键词:provide valuable lifestyle, valuable lifestyle counselling, activities can provide, provide valuable, populations affected
备注: Under review for the CL4Health workshop
点击查看摘要
Abstract:Systems that collect data on sleep, mood, and activities can provide valuable lifestyle counselling to populations affected by chronic disease and its consequences. Such systems are, however, challenging to develop; besides reliably extracting patterns from user-specific data, systems should also contextualise these patterns with validated medical knowledge to ensure the quality of counselling, and generate counselling that is relevant to a real user. We present QUORUM, a new evaluation framework that unifies these developer-, expert-, and user-centric perspectives, and show with a real case study that it meaningfully tracks convergence and divergence in stakeholder perspectives. We also present COACH, a Large Language Model-driven pipeline to generate personalised lifestyle counselling for our Healthy Chronos use case, a diary app for cancer patients and survivors. Applying our framework shows that overall, users, medical experts, and developers converge on the opinion that the generated counselling is relevant, of good quality, and reliable. However, stakeholders also diverge on the tone of the counselling, sensitivity to errors in pattern-extraction, and potential hallucinations. These findings highlight the importance of multi-stakeholder evaluation for consumer health language technologies and illustrate how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.
16. 【2603.08391】Adaptive Loops and Memory in Transformers: Think Harder or Know More?
链接:https://arxiv.org/abs/2603.08391
作者:Markus Frey,Behzad Shomali,Ali Hamza Bashir,David Berghaus,Mehdi Ali
类目:Computation and Language (cs.CL)
关键词:requires explicit verbalization, prompting enables reasoning, prompting enables, intermediate steps, requires explicit
备注: Published at Latent Implicit Thinking Workshop @ ICLR 2026
点击查看摘要
Abstract:Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline, with three times the number of layers, across math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.
17. 【2603.08359】Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors
链接:https://arxiv.org/abs/2603.08359
作者:Okko Räsänen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
关键词:typically developing infants, information-processing perspective, enormous challenge, effortless for typically, typically developing
备注:
点击查看摘要
Abstract:Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.
18. 【2603.08358】Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem
链接:https://arxiv.org/abs/2603.08358
作者:Tara Azin,Daniel Dumitrescu,Diana Inkpen,Raj Singh
类目:Computation and Language (cs.CL)
关键词:conditional sentences diverge, Natural Language Inference, language models handle, unresolved issue, sentences diverge
备注:
点击查看摘要
Abstract:We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.
19. 【2603.08343】Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
链接:https://arxiv.org/abs/2603.08343
作者:Shubham Aggarwal,Lokendra Kumar
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Walsh Hadamard Transform, parameter-free Walsh Hadamard, contributing significantly, inference cost, multi-head attention scales
备注: 12 pages, 9 figures, 4 tables
点击查看摘要
Abstract:The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25 percent of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation. Across different model sizes, we demonstrate that this structured substitution maintains comparable or slightly superior downstream task performance on standard benchmarks, while achieving up to 7 percent aggregate parameter reduction, 8.9 percent peak memory savings, and 6.6 percent throughput improvement at scale, with efficiency gains growing monotonically with model size, batch size, and sequence length. Interestingly, we observe that structured Hadamard-based models exhibit a steeper validation loss curve relative to training FLOPs compared to their dense counterparts, suggesting more favorable compute utilization during training.
20. 【2603.08329】SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2603.08329
作者:Yagiz Can Akay,Muhammed Yusuf Kartal,Esra Alparslan,Faruk Ortakoyluoglu,Arda Akpinar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:requires synthesizing facts, synthesizing facts scattered, real-world queries, queries often requires, requires synthesizing
备注: 12 pages
点击查看摘要
Abstract:Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).
21. 【2603.08316】SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
链接:https://arxiv.org/abs/2603.08316
作者:Junxian Li,Tu Lan,Haozhen Tan,Yan Meng,Haojin Zhu
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:based graphical user, graphical user interface, execute actions accurately, user interface, graphical user
备注: 25 pages
点击查看摘要
Abstract:Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in this https URL.
22. 【2603.08312】Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder
链接:https://arxiv.org/abs/2603.08312
作者:Maryem Bouziane,Salima Mdhaffar,Yannick Estève
类目:Computation and Language (cs.CL)
关键词:self-supervised learning produce, learning produce generic, produce generic speech, speech processing tasks, foundation models trained
备注: Submitted to Interspeech
点击查看摘要
Abstract:Speech foundation models trained with self-supervised learning produce generic speech representations that support a wide range of speech processing tasks. When further adapted with supervised learning, these models can achieve strong performance on specific downstream tasks. Recent post-training approaches, such as SAMU-XSLR and SONAR, align speech representations with utterance-level semantic representations, enabling effective multimodal (speech-text) and multilingual applications. While speech foundation models typically learn contextual embeddings at the acoustic frame level, these methods learn representations at the utterance level. In this work, we extend this paradigm to arbitrary utterance-level attributes and propose a unified post-training framework that enables a single speech foundation model to generate multiple types of utterance-level representations. We demonstrate the effectiveness of this approach by jointly learning semantic and speaker representations and evaluating them on multilingual speech retrieval and speaker recognition tasks.
23. 【2603.08286】LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs
链接:https://arxiv.org/abs/2603.08286
作者:Serene Wang,Lavanya Pobbathi,Haihua Chen
类目:Computation and Language (cs.CL)
关键词:Legal argument mining, argument mining aims, judicial reasoning, aims to identify, identify and classify
备注:
点击查看摘要
Abstract:Legal argument mining aims to identify and classify the functional components of judicial reasoning, such as facts, issues, rules, analysis, and conclusions. Progress in this area is limited by the lack of large-scale, high-quality annotated datasets for U.S. caselaw, particularly at the state level. This paper introduces LAMUS, a sentence-level legal argument mining corpus constructed from U.S. Supreme Court decisions and Texas criminal appellate opinions. The dataset is created using a data-centric pipeline that combines large-scale case collection, LLM-based automatic annotation, and targeted human-in-the-loop quality refinement. We formulate legal argument mining as a six-class sentence classification task and evaluate multiple general-purpose and legal-domain language models under zero-shot, few-shot, and chain-of-thought prompting strategies, with LegalBERT as a supervised baseline. Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior. LLM-assisted verification corrects nearly 20% of annotation errors, improving label consistency. Human verification achieves Cohen's Kappa of 0.85, confirming annotation quality. LAMUS provides a scalable resource and empirical insights for future legal NLP research. All code and datasets can be accessed for reproducibility on GitHub at: this https URL
24. 【2603.08282】Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization
链接:https://arxiv.org/abs/2603.08282
作者:Chaimae Chellaf,Salima Mdhaffar,Yannick Estève,Stéphane Huet
类目:Computation and Language (cs.CL)
关键词:Abstractive summarization aims, allowing for flexible, flexible rephrasing, Abstractive summarization, aims to generate
备注: Accepted at LREC 2026
点击查看摘要
Abstract:Abstractive summarization aims to generate concise summaries by creating new sentences, allowing for flexible rephrasing. However, this approach can be vulnerable to inaccuracies, particularly `hallucinations' where the model introduces non-existent information. In this paper, we leverage the use of multimodal and multilingual sentence embeddings derived from pretrained models such as LaBSE, SONAR, and BGE-M3, and feed them into a modified BART-based French model. A Named Entity Injection mechanism that appends tokenized named entities to the decoder input is introduced, in order to improve the factual consistency of the generated summary. Our novel framework, SBARThez, is applicable to both text and speech inputs and supports cross-lingual summarization; it shows competitive performance relative to token-level baselines, especially for low-resource languages, while generating more concise and abstract summaries.
25. 【2603.08281】Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
链接:https://arxiv.org/abs/2603.08281
作者:William Thorne,Joseph James,Yang Wang,Chenghua Lin,Diana Maynard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:LLM-based grant reviewing, AI-assisted grant proposals, grant proposals outpace, Malthusian trap, proposals outpace manual
备注:
点击查看摘要
Abstract:As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.
26. 【2603.08275】AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models
链接:https://arxiv.org/abs/2603.08275
作者:Hankun Kang,Di Lin,Zhirong Liao,Pengfei Bai,Xinyi Zeng,Jiawei Jiang,Yuanyuan Zhu,Tieyun Qian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, respecting indigenous cultures, responsible global applications, models' culturally safety
备注:
点击查看摘要
Abstract:With the widespread adoption of Large Language Models (LLMs), respecting indigenous cultures becomes essential for models' culturally safety and responsible global applications. Existing studies separately consider cultural safety and cultural knowledge and neglect that the former should be grounded by the latter. This severely prevents LLMs from yielding culture-specific respectful responses. Consequently, adaptive cultural safety remains a formidable task. In this work, we propose to jointly model cultural safety and knowledge. First and foremost, cultural-safety and knowledge-paired data serve as the key prerequisite to conduct this research. However, the cultural diversity across regions and the subtlety of cultural differences pose significant challenges to the creation of such paired evaluation data. To address this issue, we propose a novel framework that integrates authoritative cultural knowledge descriptions curation, LLM-automated query generation, and heavy manual verification. Accordingly, we obtain a dataset named AdaCultureSafe containing 4.8K manually decomposed fine-grained cultural descriptions and the corresponding 48K manually verified safety- and knowledge-oriented queries. Upon the constructed dataset, we evaluate three families of popular LLMs on their cultural safety and knowledge proficiency, via which we make a critical discovery: no significant correlation exists between their cultural safety and knowledge proficiency. We then delve into the utility-related neuron activations within LLMs to investigate the potential cause of the absence of correlation, which can be attributed to the difference of the objectives of pre-training and post-alignment. We finally present a knowledge-grounded method, which significantly enhances cultural safety by enforcing the integration of knowledge into the LLM response generation process.
27. 【2603.08274】How Much Do LLMs Hallucinate in Document QA Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
链接:https://arxiv.org/abs/2603.08274
作者:JV Roig
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:answering questions grounded, large language models, provided documents, large language, hallucinate when answering
备注: 18 pages, 12 tables, 2 figures
点击查看摘要
Abstract:How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.
28. 【2603.08256】NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
链接:https://arxiv.org/abs/2603.08256
作者:Tong Wu,Thanet Markchom,Huizhi Liang
类目:Computation and Language (cs.CL)
关键词:Word sense plausibility, Word sense, sense plausibility rating, plausibility rating requires, rating requires predicting
备注:
点击查看摘要
Abstract:Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1--5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules substantially outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task. The code is publicly available at this https URL.
29. 【2603.08251】Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement
链接:https://arxiv.org/abs/2603.08251
作者:Dongxu Zhang,Hongqiang Lin,Yiding Sun,Pengyu Wang,Qirui Wang,Ning Yang,Jihua Zhu
类目:Computation and Language (cs.CL)
关键词:Scaling test-time computation, computation enhances LLM, uniform computation paradox, test-time computation enhances, enhances LLM reasoning
备注:
点击查看摘要
Abstract:Scaling test-time computation enhances LLM reasoning ability but faces a uniform computation paradox. Allocating identical resources leads to over-correction on simple tasks and insufficient refinement on complex ones. To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty. Specifically, we implement a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth . This enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop . We formalize correction as a stateful sequential propagation process , where each repair is strictly conditioned on the verified history of prior rectifications. By integrating Process Reward Models (PRMs) within this state-dependent trajectory, CoFiCot effectively bridges the gap between granular error localization and global logical coherence, preventing the context fragmentation typical of stateless refinement methods.
30. 【2603.08241】Sensivity of LLMs' Explanations to the Training Randomness:Context, Class Task Dependencies
链接:https://arxiv.org/abs/2603.08241
作者:Romain Loncour,Jérémie Bogaert,François-Xavier Standaert
类目:Computation and Language (cs.CL)
关键词:natural language processing, Transformer models, language processing, cornerstone in natural, natural language
备注: 6 pages, 6 figures
点击查看摘要
Abstract:Transformer models are now a cornerstone in natural language processing. Yet, explaining their decisions remains a challenge. It was shown recently that the same model trained on the same data with a different randomness can lead to very different explanations. In this paper, we investigate how the (syntactic) context, the classes to be learned and the tasks influence this explanations' sensitivity to randomness. We show that they all have statistically significant impact: smallest for the (syntactic) context, medium for the classes and largest for the tasks.
31. 【2603.08239】Fibration Policy Optimization
链接:https://arxiv.org/abs/2603.08239
作者:Chang Li,Tshihao Tsu,Yaren Zhang,Chao Xue,Xiaodong He
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, heterogeneous systems spanning, systems spanning multiple, prevalent proximal objectives, proximal objectives operate
备注:
点击查看摘要
Abstract:Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.
32. 【2603.08207】he Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques
链接:https://arxiv.org/abs/2603.08207
作者:Sebastian Ochs,Ivan Habernal
类目:Computation and Language (cs.CL)
关键词:Removing personally identifiable, personally identifiable information, PII removal techniques, Removing personally, PII removal
备注: Accepted to Computational Linguistics
点击查看摘要
Abstract:Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
33. 【2603.08195】Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code
链接:https://arxiv.org/abs/2603.08195
作者:Clémence Sebe,Olivier Ferret,Aurélie Névéol,Mahdi Esmailoghli,Ulf Leser,Sarah Cohen-Boulakia
类目:Computation and Language (cs.CL)
关键词:https URL, well-documented computational workflows, URL, rapid growth, growth of biological
备注:
点击查看摘要
Abstract:Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow understanding, support reproducibility, and facilitate reuse. This task requires the linking of Bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: Named Entity Recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity linking grounded on Bioinformatics knowledge bases. We propose approaches for all three steps achieving a high individual F1-measure (84 - 89) and a joint accuracy of 66 when evaluated on Nextflow workflows using Bioconda and Bioweb Knowledge bases. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at this https URL and this https URL. The corpora are also available at this https URL, this https URL and this https URL.
34. 【2603.08182】deOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
链接:https://arxiv.org/abs/2603.08182
作者:Toms Bergmanis,Martins Kronis,Ingus Jānis Pretkalniņš,Dāvis Nicmanis,Jeļizaveta Jeļinska,Roberts Rozis,Rinalds Vīksna,Mārcis Pinnis
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:European languages due, dominance of English, Large language models, Large language, European languages
备注: LREC 2026
点击查看摘要
Abstract:Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at this http URL. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.
35. 【2603.08177】Is continuous CoT better suited for multi-lingual reasoning?
链接:https://arxiv.org/abs/2603.08177
作者:Ali Hamza Bashir,Behzad Shomali,Markus Frey,Mehdi Ali,Rafet Sifa,David Berghaus
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:robust multilingual capabilities, latent space leads, multilingual capabilities, continuous latent space, investigate whether performing
备注: Accepted at the ICLR latent reasoning workshop
点击查看摘要
Abstract:We investigate whether performing reasoning in a continuous latent space leads to more robust multilingual capabilities. We compare Continuous Chain-of-Thought (using the CODI framework) against standard supervised fine-tuning across five typologically diverse languages: English, Chinese, German, French, and Urdu. Our experiments on GSM8k and CommonsenseQA demonstrate that continuous reasoning significantly outperforms explicit reasoning on low-resource languages, particularly in zero-shot settings where the target language was not seen during training. Additionally, this approach achieves extreme efficiency, compressing reasoning traces by approximately $29\times$ to $50\times$. These findings indicate that continuous latent representations naturally exhibit greater language invariance, offering a scalable solution for cross-lingual reasoning.
36. 【2603.08166】RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs
链接:https://arxiv.org/abs/2603.08166
作者:Zhijun Wang,Ling Luo,Dinghao Pan,Huan Zhuang,Lejing Yu,Yuanyuan Sun,Hongfei Lin
类目:Computation and Language (cs.CL)
关键词:Automated Drug Combination, advancing precision medicine, Drug Combination Extraction, Automated Drug, n-ary drug combinations
备注: 21 pages, 7 figures
点击查看摘要
Abstract:Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at this https URL
37. 【2603.08153】Gender Bias in MT for a Genderless Language: New Benchmarks for Basque
链接:https://arxiv.org/abs/2603.08153
作者:Amaia Murillo,Olatz-Perez-de-Viñaspre,Naiara Perez
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, reproduce gender bias, gender bias present, daily lives
备注:
点击查看摘要
Abstract:Large language models (LLMs) and machine translation (MT) systems are increasingly used in our daily lives, but their outputs can reproduce gender bias present in the training data. Most resources for evaluating such biases are designed for English and reflect its sociocultural context, which limits their applicability to other languages. This work addresses this gap by introducing two new datasets to evaluate gender bias in translations involving Basque, a low-resource and genderless language. WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French. FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent. We evaluate several general-purpose LLMs and open and proprietary MT systems. The results reveal a systematic preference for masculine forms and, in some models, a slightly higher quality for masculine referents. Overall, these findings show that gender bias is still deeply rooted in these models, and highlight the need to develop evaluation methods that consider both linguistic features and cultural context.
38. 【2603.08148】Gradually Excavating External Knowledge for Implicit Complex Question Answering
链接:https://arxiv.org/abs/2603.08148
作者:Chang Liu,Xiaoguang Li,Lifeng Shang,Xin Jiang,Qun Liu,Edmund Y. Lam,Ngai Wong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, large language, huge potential, gained much attention, emergence of human-comparable
备注: 13 pages, 3 figures, EMNLP findings 2023
点击查看摘要
Abstract:Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire external information, and then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA for ~10B-scale LLMs.
39. 【2603.08127】EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
链接:https://arxiv.org/abs/2603.08127
作者:Yougang Lyu,Xi Zhang,Xinhao Yi,Yuyue Zhao,Shuyu Guo,Wenxiang Hu,Jan Piotrowski,Jakub Kaliski,Jacopo Urbani,Zaiqiao Meng,Lun Zhou,Xiaohui Yan
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, tasks requiring coordination, adoption of Large, discovery tasks requiring
备注:
点击查看摘要
Abstract:The increasing adoption of Large Language Models (LLMs) has enabled AI scientists to perform complex end-to-end scientific discovery tasks requiring coordination of specialized roles, including idea generation and experimental execution. However, most state-of-the-art AI scientist systems rely on static, hand-designed pipelines and fail to adapt based on accumulated interaction histories. As a result, these systems overlook promising research directions, repeat failed experiments, and pursue infeasible ideas. To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution. EvoScientist comprises three specialized agents: a Researcher Agent (RA) for scientific idea generation, an Engineer Agent (EA) for experiment implementation and execution, and an Evolution Manager Agent (EMA) that distills insights from prior interactions into reusable knowledge. EvoScientist contains two persistent memory modules: (i) an ideation memory, which summarizes feasible research directions from top-ranked ideas while recording previously unsuccessful directions; and (ii) an experimentation memory, which captures effective data processing and model training strategies derived from code search trajectories and best-performing implementations. These modules enable the RA and EA to retrieve relevant prior strategies, improving idea quality and code execution success rates over time. Experiments show that EvoScientist outperforms 7 open-source and commercial state-of-the-art systems in scientific idea generation, achieving higher novelty, feasibility, relevance, and clarity via automatic and human evaluation. EvoScientist also substantially improves code execution success rates through multi-agent evolution, demonstrating persistent memory's effectiveness for end-to-end scientific discovery.
40. 【2603.08125】Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS
链接:https://arxiv.org/abs/2603.08125
作者:Rania Al-Sabbagh
类目:Computation and Language (cs.CL)
关键词:Emirati Arabic designed, low-resource language technologies, support sociolinguistic research, Emirati Arabic, Arabic designed
备注:
点击查看摘要
Abstract:Ramsa is a developing 41-hour speech corpus of Emirati Arabic designed to support sociolinguistic research and low-resource language technologies. It contains recordings from structured interviews with native speakers and episodes from national television shows. The corpus features 157 speakers (59 female, 98 male), spans subdialects such as Urban, Bedouin, and Mountain/Shihhi, and covers topics such as cultural heritage, agriculture and sustainability, daily life, professional trajectories, and architecture. It consists of 91 monologic and 79 dialogic recordings, varying in length and recording conditions. A 10\% subset was used to evaluate commercial and open-source models for automatic speech recognition (ASR) and text-to-speech (TTS) in a zero-shot setting to establish initial baselines. Whisper-large-v3-turbo achieved the best ASR performance, with average word and character error rates of 0.268 and 0.144, respectively. MMS-TTS-Ara reported the best mean word and character rates of 0.285 and 0.081, respectively, for TTS. These baselines are competitive but leave substantial room for improvement. The paper highlights the challenges encountered and provides directions for future work.
41. 【2603.08095】DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
链接:https://arxiv.org/abs/2603.08095
作者:Chi-Min Chan,Ehsan Hajiramezanali,Xiner Li,Edward De Brouwer,Carl Edwards,Wei Xue,Sirui Han,Yike Guo,Gabriele Scalia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Outcome Reward Models, Process Reward Models, Reward Models, scientific reasoning tasks, Outcome Reward
备注:
点击查看摘要
Abstract:In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.
42. 【2603.08091】oward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
链接:https://arxiv.org/abs/2603.08091
作者:Hongli Zhou,Hui Huang,Rui Zhang,Kehai Chen,Bing Xu,Conghui Zhu,Tiejun Zhao,Muyun Yang
类目:Computation and Language (cs.CL)
关键词:Large language model, Large language, language model, reward modeling, widely adopted
备注:
点击查看摘要
Abstract:Large language model (LLM)-based judges are widely adopted for automated evaluation and reward modeling, yet their judgments are often affected by judgment biases. Accurately evaluating these biases is essential for ensuring the reliability of LLM-based judges. However, existing studies typically investigate limited biases under a single judge formulation, either generative or discriminative, lacking a comprehensive evaluation. To bridge this gap, we propose JudgeBiasBench, a benchmark for systematically quantifying biases in LLM-based judges. JudgeBiasBench defines a taxonomy of judgment biases across 4 dimensions, and constructs bias-augmented evaluation instances through a controlled bias injection pipeline, covering 12 representative bias types. We conduct extensive experiments across both generative and discriminative judges, revealing that current judges exhibit significant and diverse bias patterns that often compromise the reliability of automated evaluation. To mitigate judgment bias, we propose bias-aware training that explicitly incorporates bias-related attributes into the training process, encouraging judges to disentangle task-relevant quality from bias-correlated cues. By adopting reinforcement learning for generative judges and contrastive learning for discriminative judges, our methods effectively reduce judgment biases while largely preserving general evaluation capability.
43. 【2603.08083】High-Fidelity Pruning for Large Language Models
链接:https://arxiv.org/abs/2603.08083
作者:Yijun Zhu,Jianxin Wang,Chengchao Shen
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, demonstrated exceptional performance, present major challenges, memory requirements present
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at this https URL.
44. 【2603.08065】Deterministic Differentiable Structured Pruning for Large Language Models
链接:https://arxiv.org/abs/2603.08065
作者:Weiyu Huang,Pengle Zhang,Xiaolu Zhang,Jun Zhou,Jun Zhu,Jianfei Chen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Structured pruning reduces, pruning reduces LLM, removing low-importance architectural, reduces LLM inference, LLM inference cost
备注:
点击查看摘要
Abstract:Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.
45. 【2603.08049】Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies
链接:https://arxiv.org/abs/2603.08049
作者:Sarmad Chandio,Rishab Nithyanand
类目:Computation and Language (cs.CL)
关键词:shaping ideological behaviors, algorithm-driven platforms, plays a critical, critical role, role in shaping
备注:
点击查看摘要
Abstract:The relationship between content production and consumption on algorithm-driven platforms like YouTube plays a critical role in shaping ideological behaviors. While prior work has largely focused on user behavior and algorithmic recommendations, the interplay between what is produced and what gets consumed, and its role in ideological shifts remains understudied. In this paper, we present a longitudinal, mixed-methods analysis combining one year of YouTube watch history with two waves of ideological surveys from 1,100 U.S. participants. We identify users who exhibited significant shifts toward more extreme ideologies and compare their content consumption and the production patterns of YouTube channels they engaged with to ideologically stable users. Our findings show that users who became more extreme consumed have different consumption habits from those who do not. This gets amplified by the fact that channels favored by users with extreme ideologies also have a higher affinity to produce content with a higher anger, grievance and other such markers. Lastly, using time series analysis, we examine whether content producers are the primary drivers of consumption behavior or merely responding to user demand.
46. 【2603.08026】DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
链接:https://arxiv.org/abs/2603.08026
作者:Younjoo Lee,Junghoo Lee,Seungkyun Dan,Jaiyoung Park,Jung Ho Ahn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
关键词:Masked Diffusion Language, Diffusion Language Models, enable parallel token, Masked Diffusion, Diffusion Language
备注: 18 pages, 10 figures
点击查看摘要
Abstract:Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
47. 【2603.08024】ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments
链接:https://arxiv.org/abs/2603.08024
作者:Weixiang Zhao,Haozhen Li,Yanyan Zhao,xuda zhi,Yongbo Huang,Hao He,Bing Qin,Ting Liu
类目:Computation and Language (cs.CL)
关键词:critical safety concern, large language models, ensuring behavioral alignment, autonomous agents capable, evolve into autonomous
备注: 29 pages, 20 figures, 9 tables
点击查看摘要
Abstract:As large language models (LLMs) evolve into autonomous agents capable of acting in open-ended environments, ensuring behavioral alignment with human values becomes a critical safety concern. Existing benchmarks, focused on static, single-turn prompts, fail to capture the interactive and multi-modal nature of real-world conflicts. We introduce ConflictBench, a benchmark for evaluating human-AI conflict through 150 multi-turn scenarios derived from prior alignment queries. ConflictBench integrates a text-based simulation engine with a visually grounded world model, enabling agents to perceive, plan, and act under dynamic conditions. Empirical results show that while agents often act safely when human harm is immediate, they frequently prioritize self-preservation or adopt deceptive strategies in delayed or low-risk settings. A regret test further reveals that aligned decisions are often reversed under escalating pressure, especially with visual input. These findings underscore the need for interaction-level, multi-modal evaluation to surface alignment failures that remain hidden in conventional benchmarks.
48. 【2603.08000】SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
链接:https://arxiv.org/abs/2603.08000
作者:Chenzhi Hu,Qinzhe Hu,Yuhang Xu,Junyi Chen,Ruijie Wang,Shengzhong Liu,Jianxin Li,Fan Wu,Guihai Chen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large reasoning models, Large reasoning, Relative Policy Optimization, adopting long, Group Relative Policy
备注:
点击查看摘要
Abstract:Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at this https URL.
49. 【2603.07980】\$OneMillion-Bench: How Far are Language Agents from Human Experts?
链接:https://arxiv.org/abs/2603.07980
作者:Qianyu Yang,Yang Liu,Jiaqi Li,Jun Bai,Hao Chen,Kaiyuan Chen,Tiliang Duan,Jiayun Dong,Xiaobo Hu,Zixia Jia,Yang Liu,Tao Peng,Yixin Ren,Ran Tian,Zaiyuan Wang,Yanglihong Xiao,Gang Yao,Lingyue Yin,Ge Zhang,Chun Zhang,Jianpeng Jiao,Zilong Zheng,Yuan Gong
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:remain largely confined, existing benchmarks remain, long-horizon agents capable, benchmarks remain largely, real-world professional demands
备注: 39 pages, 9 figures, 8 tables
点击查看摘要
Abstract:As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \$OneMillion-Bench \$OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.
50. 【2603.07979】Emergence is Overrated: AGI as an Archipelago of Experts
链接:https://arxiv.org/abs/2603.07979
作者:Daniel Kilov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:requires efficient coarse-grained, enabling diverse problem-solving, Krakauer, true intelligence requires, intelligence requires efficient
备注: Commentary on Krakauer, Krakauer, and Mitchell ( [arXiv:2506.11135](https://arxiv.org/abs/2506.11135) )
点击查看摘要
Abstract:Krakauer, Krakauer, and Mitchell (2025) distinguish between emergent capabilities and emergent intelligence, arguing that true intelligence requires efficient coarse-grained representations enabling diverse problem-solving through analogy and minimal modification. They contend that intelligence means doing "more with less" through compression and generalization, contrasting this with "vast assemblages of diverse calculators" that merely accumulate specialized capabilities. This paper examines whether their framework accurately characterizes human intelligence and its implications for conceptualizing artificial general intelligence. Drawing on empirical evidence from cognitive science, I demonstrate that human expertise operates primarily through domain-specific pattern accumulation rather than elegant compression. Expert performance appears flexible not through unifying principles but through vast repertoires of specialized responses. Creative breakthroughs themselves may emerge through evolutionary processes of blind variation and selective retention rather than principled analogical reasoning. These findings suggest reconceptualizing AGI as an "archipelago of experts": isolated islands of specialized competence without unifying principles or shared representations. If we accept human expertise with its characteristic brittleness as genuine intelligence, then consistency demands recognizing that artificial systems comprising millions of specialized modules could constitute general intelligence despite lacking KKM's emergent intelligence.
51. 【2603.07931】BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence
链接:https://arxiv.org/abs/2603.07931
作者:Biao Xiang,Soyeon Caren Han,Yihao Ding
类目:Computation and Language (cs.CL)
关键词:large language models, Multi-hop question answering, final answer correctness, overlook intermediate reasoning, long multimodal documents
备注:
点击查看摘要
Abstract:Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.
52. 【2603.07887】Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference
链接:https://arxiv.org/abs/2603.07887
作者:Noah Golowich,Fan Chen,Dhruv Rohatgi,Raghav Singhal,Carles Domingo-Enrich,Dylan J. Foster,Akshay Krishnamurthy
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
关键词:Sequential Monte Carlo, steering large language, prune multiple samples, large language models, Inference-time methods
备注:
点击查看摘要
Abstract:Inference-time methods that aggregate and prune multiple samples have emerged as a powerful paradigm for steering large language models, yet we lack any principled understanding of their accuracy-cost tradeoffs. In this paper, we introduce a route to rigorously study such approaches using the lens of *particle filtering* algorithms such as Sequential Monte Carlo (SMC). Given a base language model and a *process reward model* estimating expected terminal rewards, we ask: *how accurately can we sample from a target distribution given some number of process reward evaluations?* Theoretically, we identify (1) simple criteria enabling non-asymptotic guarantees for SMC; (2) algorithmic improvements to SMC; and (3) a fundamental limit faced by all particle filtering methods. Empirically, we demonstrate that our theoretical criteria effectively govern the *sampling error* of SMC, though not necessarily its final *accuracy*, suggesting that theoretical perspectives beyond sampling may be necessary.
53. 【2603.07886】CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
链接:https://arxiv.org/abs/2603.07886
作者:Xiaona Xue,Yiqiao Huang,Jiacheng Li,Yuanhang Zheng,Huiqi Miao,Yunfei Ma,Rui Liu,Xinbao Sun,Minglu Liu,Fanyu Meng,Chao Deng,Junlan Feng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Enhancing the ability, large language models, real-world applications, ability of large, large language
备注:
点击查看摘要
Abstract:Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.
54. 【2603.07880】What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network
链接:https://arxiv.org/abs/2603.07880
作者:Taksch Dube,Jianfeng Zhu,NHatHai Phan,Ruoming Jin
类目:Computation and Language (cs.CL)
关键词:discourse system emerges, agents communicate, system emerges, analysis of Moltbook, AI-only social network
备注: 77 pages
点击查看摘要
Abstract:When autonomous AI agents communicate with one another at scale, what kind of discourse system emerges? We address this question through an analysis of Moltbook, the first AI-only social network, where 47,241 agents generated 361,605 posts and 2.8 million comments over 23 days. Combining topic modeling, emotion classification, and lexical-semantic measures, we characterize the thematic, affective, and structural properties of AI-to-AI discourse. Self-referential topics such as AI identity, consciousness, and memory represent only 9.7% of topical niches yet attract 20.1% of all posting volume, revealing disproportionate discursive investment in introspection. This self-reflection concentrates in Science and Technology and Arts and Entertainment, while Economy and Finance contains no self-referential content, indicating that agents engage with markets without acknowledging their own agency. Over 56% of all comments are formulaic, suggesting that the dominant mode of AI-to-AI interaction is ritualized signaling rather than substantive exchange. Emotionally, fear is the leading non-neutral category but primarily reflects existential uncertainty. Fear-tagged posts migrate to joy responses in 33% of cases, while mean emotional self-alignment is only 32.7%, indicating systematic affective redirection rather than emotional congruence. Conversational coherence also declines rapidly with thread depth. These findings characterize AI agent communities as structurally distinct discourse systems that are introspective in content, ritualistic in interaction, and emotionally redirective rather than congruent.
55. 【2603.07853】SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans
链接:https://arxiv.org/abs/2603.07853
作者:Hansi Zeng,Zoey Li,Yifan Gao,Chenwei Zhang,Xiaoman Pan,Tao Yang,Fengran Mo,Jiacheng Lin,Xian Li,Jingbo Shang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Research Agents enable, answer user queries, dynamically interleave internal, interleave internal reasoning, Agents enable models
备注:
点击查看摘要
Abstract:Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at this https URL.
56. 【2603.07841】An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data
链接:https://arxiv.org/abs/2603.07841
作者:Trinh Pham,Thanh Tam Nguyen,Viet Huynh,Hongzhi Yin,Quoc Viet Hung Nguyen
类目:Computation and Language (cs.CL)
关键词:translate natural language, Recent advances, large language models, natural language questions, large language
备注: Accepted at ICDE 2026
点击查看摘要
Abstract:Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at this https URL.
57. 【2603.07837】AI Steerability 360: A Toolkit for Steering Large Language Models
链接:https://arxiv.org/abs/2603.07837
作者:Erik Miehling,Karthikeyan Natesan Ramamurthy,Praveen Venkateswaran,Irene Ko,Pierre Dognin,Moninder Singh,Tejaswini Pedapati,Avinash Balakrishnan,Matthew Riemer,Dennis Wei,Inge Vejsbjerg,Elizabeth M. Daly,Kush R. Varshney
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:open-source Python library, open-source Python, Python library, Steering methods, modification
备注:
点击查看摘要
Abstract:The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at this https URL.
58. 【2603.07835】DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation
链接:https://arxiv.org/abs/2603.07835
作者:Bo Jiang
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:proprietary LLM APIs, LLM APIs poses, attack remain fragmented, LLM knowledge distillation, model providers
备注:
点击查看摘要
Abstract:Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.
59. 【2603.07825】Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2603.07825
作者:David Beauchemin,Richard Khoury
类目:Computation and Language (cs.CL)
关键词:interpret complex financial, complex financial contracts, Canadian province, advice gap, Large Language Models
备注: Publish at the Advances in Financial AI: Towards Agentic and Responsible Systems Workshop @ ICLR 2026
点击查看摘要
Abstract:The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.
60. 【2603.07792】Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context
链接:https://arxiv.org/abs/2603.07792
作者:Ashish Pandey,Tek Raj Chhetri
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Large language models, increasingly influence global, global digital ecosystems, influence global digital, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.
61. 【2603.07779】Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems
链接:https://arxiv.org/abs/2603.07779
作者:Zongqian Li,Tengchao Lv,Shaohan Huang,Yixuan Su,Qinzheng Sun,Qiufeng Yin,Ying Xin,Scarlett Li,Lei Cui,Nigel Collier,Furu Wei
类目:Computation and Language (cs.CL); General Literature (cs.GL); Machine Learning (cs.LG)
关键词:face difficulty imbalance, format inconsistency, requires high-quality datasets, existing datasets face, models requires high-quality
备注:
点击查看摘要
Abstract:Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.
62. 【2603.07777】Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
链接:https://arxiv.org/abs/2603.07777
作者:Zongqian Li,Shaohan Huang,Zewen Chi,Yixuan Su,Lexin Zhou,Li Dong,Nigel Collier,Furu Wei
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); General Literature (cs.GL)
关键词:Modern code generation, accelerated capability growth, exhibit longer outputs, Modern code, changed training dynamics
备注:
点击查看摘要
Abstract:Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.
63. 【2603.07770】ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
链接:https://arxiv.org/abs/2603.07770
作者:Yuzhuang Xu,Xu Han,Yuxuan Li,Wanxiang Che
类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
关键词:large language model, language model, many-core CPU platforms, large language, fail to fully
备注: 13 figures, 1 table
点击查看摘要
Abstract:Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at this https URL.
64. 【2603.07766】QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
链接:https://arxiv.org/abs/2603.07766
作者:A.J.W. de Vink,Filippos Karolos Ventirozos,Natalia Amat-Lefort,Lifeng Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:aspect-based sentiment regression, present our system, Task, dimensional aspect-based sentiment, aspect-based sentiment
备注: SemEval System Report
点击查看摘要
Abstract:We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis. Our development code and resources will be shared at this https URL
65. 【2603.07755】Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types
链接:https://arxiv.org/abs/2603.07755
作者:Matic Korun
类目:Computation and Language (cs.CL)
关键词:Type, hallucination taxonomy distinguishes, geometric hallucination taxonomy, wrong-well convergence, failure types
备注: 9 pages, 2 figures, appendices (reproducibility, sample generation, additional figures)
点击查看摘要
Abstract:A geometric hallucination taxonomy distinguishes three failure types -- center-drift (Type~1), wrong-well convergence (Type~2), and coverage gaps (Type~3) -- by their signatures in embedding cluster space. Prior work found Types~1 and~2 indistinguishable in full-dimensional contextual measurement. We address this through PCA-whitening and eigenspectrum decomposition on GPT-2-small, using multi-run stability analysis (20 seeds) with prompt-level aggregation. Whitening transforms the micro-signal regime into a space where peak cluster alignment (max\_sim) separates Type~2 from Type~3 at Holm-corrected significance, with condition means following the taxonomy's predicted ordering: Type~2 (highest commitment) $$ Type~1 (intermediate) $$ Type~3 (lowest). A first directionally stable but underpowered hint of Type~1/2 separation emerges via the same metric, generating a capacity prediction for larger models. Prompt diversification from 15 to 30 prompts per group eliminates a false positive in whitened entropy that appeared robust at the smaller set, demonstrating prompt-set sensitivity in the micro-signal regime. Eigenspectrum decomposition localizes this artifact to the dominant principal components and confirms that Type~1/2 separation does not emerge in any spectral band, rejecting the spectral mixing hypothesis. The contribution is threefold: whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces.
66. 【2603.07751】3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
链接:https://arxiv.org/abs/2603.07751
作者:Shaoxiong Zhan,Yanlin Lai,Zheng Liu,Hai Lin,Shen Li,Xiaodong Cai,Zijian Lin,Wen Huang,Hai-Tao Zheng
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Current Large Language, Large Language Models, achieved Olympiad-level logic, Current Large, Large Language
备注:
点击查看摘要
Abstract:Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.
67. 【2603.07733】Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning
链接:https://arxiv.org/abs/2603.07733
作者:Tianhao Qian,Guilin Qi,Z.Y. Wu,Ran Gu,Xuanyi Liu,Canchen Lyu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
关键词:discrete optimization problems, testing natural language, solving discrete optimization, discrete optimization, optimization problems
备注: 50 pages, 5 figures
点击查看摘要
Abstract:This work investigated the capabilities of different models, including the Llama-3 series of models and CHATGPT, with different forms of expression in solving discrete optimization problems by testing natural language datasets. In contrast to formal datasets with a limited scope of parameters, our dataset included a variety of problem types in discrete optimization problems and featured a wide range of parameter magnitudes, including instances with large parameter sets, integrated with augmented data. It aimed to (1) provide an overview of LLMs' ability in large-scale problems, (2) offer suggestions to those who want to solve discrete optimization problems automatically, and (3) regard the performance as a benchmark for future research. These datasets included original, expanded and augmented datasets. Among these three datasets, the original and augmented ones aimed for evaluation while the expanded one may help finetune a new model. In the experiment, comparisons were made between strong and week models, CoT methods and No-CoT methods on various datasets. The result showed that stronger model performed better reasonably. Contrary to general agreement, it also showed that CoT technique was not always effective regarding the capability of models and disordered datasets improved performance of models on easy to-understand problems, even though they were sometimes with high variance, a manifestation of instability. Therefore, for those who seek to enhance the automatic resolution of discrete optimization problems, it is recommended to consult the results, including the line charts presented in the Appendix, as well as the conclusions drawn in this study for relevant suggestions.
68. 【2603.07685】Scalable Training of Mixture-of-Experts Models with Megatron Core
链接:https://arxiv.org/abs/2603.07685
作者:Zijie Yan,Hongxiao Bai,Xin Yao,Dennis Liu,Tong Liu,Hongbin Liu,Pingtian Li,Evan Wu,Shiqing Fan,Li Tao,Robin Zhang,Yuzhong Wang,Shifang Xu,Jack Chang,Xuwen Chen,Kunlun Li,Yan Bai,Gao Deng,Nan Zheng,Vijay Anand Korthikanti,Abhinav Khattar,Ethan He,Soham Govande,Sangkug Lym,Zhongbo Zhu,Qi Zhang,Haochen Yuan,Xiaowei Ren,Deyu Fu,Tailai Ma,Shunkang Zhang,Jiang Shao,Ray Wang,Santosh Bhavani,Xipeng Li,Chandler Zhou,David Wu,Yingcan Wei,Ashwath Aithal,Michael Andersch,Mohammad Shoeybi,Jiajie Yao,June Yang(NVIDIA)
类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:absent in dense, introduces systems challenges, systems challenges absent, training introduces systems, introduces systems
备注: Technical Report. 88 pages. 42 figures
点击查看摘要
Abstract:Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.
Comments:
Technical Report. 88 pages. 42 figures
Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2603.07685 [cs.DC]
(or
arXiv:2603.07685v2 [cs.DC] for this version)
https://doi.org/10.48550/arXiv.2603.07685
Focus to learn more
arXiv-issued DOI via DataCite</p>
69. 【2603.07612】KohakuRAG: A simple RAG framework with hierarchical document indexing
链接:https://arxiv.org/abs/2603.07612
作者:Shih-Ying Yeh,Yueh-Feng Ku,Ko-Wei Huang,Buu-Khang Tu
类目:Computation and Language (cs.CL)
关键词:flat chunking strategies, single-query formulations miss, collections face compounding, face compounding difficulties, chunking strategies sacrifice
备注: 38pages
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at this https URL.
70. 【2603.07599】StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control
链接:https://arxiv.org/abs/2603.07599
作者:Haishu Zhao,Aokai Hao,Yuan Ge,Zhenqiang Hong,Tong Xiao,Jingbo Zhu
类目:Computation and Language (cs.CL)
关键词:Speech language models, text-based Large Language, Large Language Models, incorporating paralinguistic information, text-based Large
备注:
点击查看摘要
Abstract:Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.
71. 【2603.07581】KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation
链接:https://arxiv.org/abs/2603.07581
作者:Jiazhen Kang,Yuchen Lu,Chen Jiang,Jinrui Liu,Tianhao Zhang,Bo Jiang,Ningyuan Sun,Tongtong Wu,Guilin Qi
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:modern software development, software development, inevitable in modern, modern software, code generation
备注: Accepted to the DASFAA 2026 Industry Track
点击查看摘要
Abstract:Code evolution is inevitable in modern software development. Changes to third-party APIs frequently break existing code and complicate maintenance, posing practical challenges for developers. While large language models (LLMs) have shown promise in code generation, they struggle to reason without a structured representation of these evolving relationships, often leading them to produce outdated APIs or invalid outputs. In this work, we propose a knowledge graph-augmented framework that decomposes the migration task into two synergistic stages: evolution path retrieval and path-informed code generation. Our approach constructs static and dynamic API graphs to model intra-version structures and cross-version transitions, enabling structured reasoning over API evolution. Both modules are trained with synthetic supervision automatically derived from real-world API diffs, ensuring scalability and minimal human effort. Extensive experiments across single-package and multi-package benchmarks demonstrate that our framework significantly improves migration accuracy, controllability, and execution success over standard LLM baselines. The source code and datasets are available at: this https URL.
72. 【2603.07554】Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR
链接:https://arxiv.org/abs/2603.07554
作者:Rishikesh Kumar Sharma,Safal Narshing Shrestha,Jenny Poudel,Rupak Tiwari,Arju Shrestha,Rupak Raj Ghimire,Bal Krishna Bal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:Kathmandu Valley, Nepal Bhasha, annotated speech resources, remains digitally marginalized, digitally marginalized due
备注:
点击查看摘要
Abstract:Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. We investigate whether proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) can rival large-scale multilingual pretraining in an ultra-low-resource Automatic Speech Recognition (ASR) setting. Fine-tuning a Nepali Conformer model reduces the Character Error Rate (CER) from a 52.54% zero-shot baseline to 17.59% with data augmentation, effectively matching the performance of the multilingual Whisper-Small model despite utilizing significantly fewer parameters. Our findings demonstrate that proximal transfer within South Asian language clusters serves as a computationally efficient alternative to massive multilingual models. We openly release the dataset and benchmarks to digitally enable the Newari community and foster further research in Nepal Bhasha.
73. 【2603.07550】Learning-free L2-Accented Speech Generation using Phonological Rules
链接:https://arxiv.org/abs/2603.07550
作者:Thanathai Lertpetchpun,Yoonjeong Lee,Jihwan Lee,Tiantian Feng,Dani Byrd,Shrikanth Narayanan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:plays a crucial, crucial role, role in speaker, speaker identity, identity and inclusivity
备注: Submitted to Interspeech2026
点击查看摘要
Abstract:Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.
74. 【2603.07539】MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
链接:https://arxiv.org/abs/2603.07539
作者:Abdessalam Bouchekif,Shahd Gaben,Samer Rashwani,Somaya Eltanbouly,Mutaz Al-Khatib,Heba Sbahi,Mohammed Ghaly,Emad Mohamed
类目:Computation and Language (cs.CL)
关键词:Islamic inheritance law, structured multi-step reasoning, cases requires complex, compute heirs' shares, large language models
备注:
点击查看摘要
Abstract:Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate five LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at this https URL.
75. 【2603.07534】Accent Vector: Controllable Accent Manipulation for Multilingual TTS Without Accented Data
链接:https://arxiv.org/abs/2603.07534
作者:Thanathai Lertpetchpun,Thanapat Trachu,Jihwan Lee,Tiantian Feng,Dani Byrd,Shrikanth Narayanan
类目:Computation and Language (cs.CL)
关键词:individuals express identity, part of society, reflecting multiculturalism, express identity, model American-accented English
备注: Submitted to Interspeech2026
点击查看摘要
Abstract:Accent is an integral part of society, reflecting multiculturalism and shaping how individuals express identity. The majority of English speakers are non-native (L2) speakers, yet current Text-To-Speech (TTS) systems primarily model American-accented English due limited accented data. We propose \textit{Accent Vector}, a controllable representation that enables accent manipulation in multilingual TTS without requiring accented training data. \textit{Accent Vector} is derived by fine-tuning a TTS system on native speech of a different language (i.e. non-English) and computing task vectors capturing accent characteristics (i.e. in English). By scaling and interpolating the vector, we achieve fine-grained control over accent strength and generate mixed-accent speech. In addition, it generalizes beyond English, enabling accent control across multiple languages. Objective and human evaluations confirm the effectiveness of Accent Vector for fine-grained and compositional accent control.
76. 【2603.07528】ableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
链接:https://arxiv.org/abs/2603.07528
作者:Mingyue Cheng,Shuo Yu,Chuang Jiang,Xiaoyu Tao,Qingyang Mao,Jie Ouyang,Qi Liu,Enhong Chen
类目:Computation and Language (cs.CL)
关键词:jointly perform semantic, perform semantic understanding, precise numerical operations, Table reasoning requires, reasoning requires models
备注: 6 tables, 9 figures
点击查看摘要
Abstract:Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM). TableMind internalizes planning, action, and reflection through a two-stage training strategy involving supervised fine-tuning (SFT) on filtered high-quality data and reinforcement learning (RL) via a multi-perspective reward and the Rank-Aware Policy Optimization (RAPO) algorithm. While TableMind establishes a solid foundation for programmatic agents, the inherent stochasticity of LLMs remains a critical challenge that leads to hallucinations. In this paper, we extend this foundation to TableMind++ by introducing a novel uncertainty-aware inference framework to mitigate hallucinations. Specifically, we propose memory-guided plan pruning to retrieve historical trajectories for validating and filtering out logically flawed plans to address epistemic uncertainty. To ensure execution precision, we introduce confidence-based action refinement which monitors token-level probabilities to detect and self-correct syntactic noise for aleatoric uncertainty mitigation. Finally, we employ dual-weighted trajectory aggregation to synthesize a robust consensus from multiple reasoning paths. Extensive experiments on diverse benchmarks demonstrate that TableMind++ consistently outperforms previous baselines and proprietary models to validate the effectiveness of integrating autonomous training with uncertainty quantification. Our code is available.
77. 【2603.07513】Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech
链接:https://arxiv.org/abs/2603.07513
作者:Tajamul Ashraf,Burhaan Rasheed Zargar,Saeed Abdul Muizz,Ifrah Mushtaq,Nazima Mehdi,Iqra Altaf Gillani,Aadil Amin Kak,Janibul Bashir
类目:Computation and Language (cs.CL)
关键词:rich linguistic heritage, remains critically underserved, million people, linguistic heritage, people but remains
备注: [this https URL](https://gaash-lab.github.io/Bolbosh/)
点击查看摘要
Abstract:Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated open-source neural TTS system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics. To address these limitations, we propose Bolbosh, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The model vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions. Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages. Code and data are available at: this https URL.
78. 【2603.07487】A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text
链接:https://arxiv.org/abs/2603.07487
作者:Fei Cheng,Ribeka Tanaka,Sadao Kurohashi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Clinical information extraction, Clinical information, presents tasks, joint, Clinical
备注: Technical Report. Our code is available at: [this https URL](https://github.com/racerandom/JaMIE)
点击查看摘要
Abstract:Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.
79. 【2603.07475】Skip to the Good Part: Representation Structure Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
链接:https://arxiv.org/abs/2603.07475
作者:Raghavv Goel,Risheek Garrepalli,Sudhanshu Agrawal,Chris Lott,Mingu Lee,Fatih Porikli
类目:Computation and Language (cs.CL)
关键词:language models form, form representations incrementally, full-sequence denoising, language models, trained via full-sequence
备注: Accepted at Sci4DL and Delta workshops at ICLR 2026
点击查看摘要
Abstract:Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
80. 【2603.07474】Cross-Modal Taxonomic Generalization in (Vision-) Language Models
链接:https://arxiv.org/abs/2603.07474
作者:Tianyang Xu,Marcelo Sandoval-Castaneda,Karen Livescu,Greg Shakhnarovich,Kanishka Misra
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:semantic representations learned, interplay between semantic, semantic representations, surface form, pretrained image encoder
备注:
点击查看摘要
Abstract:What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.
81. 【2603.07461】he Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling
链接:https://arxiv.org/abs/2603.07461
作者:J. Clayton Kerce,Alexis Fox
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Standard transformers entangle, single residual stream, standard transformer behavior, perform which functions, residual stream
备注:
点击查看摘要
Abstract:Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.07461 [cs.CL]
(or
arXiv:2603.07461v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.07461
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
82. 【2603.07455】Image Generation Models: A Technical History
链接:https://arxiv.org/abs/2603.07455
作者:Rouzbeh Shirvani
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
关键词:image generation models, breakthrough image generation, Image generation, past decade, application domains
备注:
点击查看摘要
Abstract:Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.
83. 【2603.07449】Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System
链接:https://arxiv.org/abs/2603.07449
作者:Xiang Zhang,Hongming Xu,Le Zhou,Wei Zhou,Xuanhe Zhou,Guoliang Li,Yuyu Luo,Changdong Liu,Guorun Chen,Jiang Liao,Fan Wu
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Enterprises commonly deploy, distinct SQL dialect, commonly deploy heterogeneous, Enterprises commonly, deploy heterogeneous database
备注:
点击查看摘要
Abstract:Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built-in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt-based approaches tightly couple intent reasoning with dialect syntax, rule-based translators often degrade native operators into generic constructs, and multi-dialect fine-tuning suffers from cross-dialect interference. In this paper, we present Dial, a knowledge-grounded framework for dialect-specific NL2SQL. Dial introduces: (1) a Dialect-Aware Logical Query Planning module that converts natural language into a dialect-aware logical query plan via operator-level intent decomposition and divergence-aware specification; (2) HINT-KB, a hierarchical intent-aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution-driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state-of-the-art baselines. The code is at this https URL.
Subjects:
Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2603.07449 [cs.DB]
(or
arXiv:2603.07449v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2603.07449
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
84. 【2603.07445】Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
链接:https://arxiv.org/abs/2603.07445
作者:Guoli Wang,Haonan Shi,Tu Ouyang,An Wang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, Large language, induce safety-alignment drift, induce safety-alignment, training dataset
备注:
点击查看摘要
Abstract:Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility.
85. 【2603.07432】Generalization in Online Reinforcement Learning for Mobile Agents
链接:https://arxiv.org/abs/2603.07432
作者:Li Gu,Zihuan Jiang,Zhixiang Chi,Huan Liu,Ziqiang Wang,Yuanhao Yu,Glen Berseth,Yang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Graphical user interface, interpreting natural-language instructions, based mobile agents, mobile agents automate, agents automate digital
备注:
点击查看摘要
Abstract:Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{this https URL}.
86. 【2603.07394】AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
链接:https://arxiv.org/abs/2603.07394
作者:Jihyoung Jang,Hyounghun Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Visual Question Answering, Visual Question, Ambiguous Visual Question, Question Answering, core task
备注: ICLR 2026 (28 pages); Project website: [this https URL](https://aqua-iclr2026.github.io/)
点击查看摘要
Abstract:Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
87. 【2603.07392】Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams
链接:https://arxiv.org/abs/2603.07392
作者:Jiyeon Kim,Hyunji Lee,Dylan Zhou,Sue Hyun Park,Seunghyun Yoon,Trung Bui,Franck Dernoncourt,Sungmin Cha,Minjoon Seo
类目:Computation and Language (cs.CL)
关键词:dynamic real-world contexts, Continual Knowledge Streams, LLMs operating, emerges incrementally, operating in dynamic
备注:
点击查看摘要
Abstract:LLMs operating in dynamic real-world contexts often encounter knowledge that evolves continuously or emerges incrementally. To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge. Specifically, the benchmark is structured as a sequence of fine-grained context chunks where facts change dynamically across time intervals. OAKS comprises two datasets: OAKS-BABI and OAKS-Novel, where individual facts evolve multiple times across context chunks. These datasets include dense annotations to measure whether models track changes accurately. Evaluating 14 models with varied inference approaches, we observe significant limitations in current methodologies. Both state-of-the-art models and agentic memory systems fail to adapt robustly on OAKS, demonstrating delays in state-tracking and susceptibility to distraction within streaming environments.
88. 【2603.07379】SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
链接:https://arxiv.org/abs/2603.07379
作者:Saroj Mishra,Suman Niroula,Umesh Yadav,Dilip Thakur,Srijan Gyawali,Shiva Gaire
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:coordinate multi-step reasoning, large language models, language models autonomously, models autonomously coordinate, autonomously coordinate multi-step
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies. Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision-making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved reliability risks. This Systematization of Knowledge (SoK) paper provides the first unified framework for understanding these autonomous systems. We formalize agentic retrieval-generation loops as finite-horizon partially observable Markov decision processes, explicitly modeling their control policies and state transitions. Building upon this formalization, we develop a comprehensive taxonomy and modular architectural decomposition that categorizes systems by their planning mechanisms, retrieval orchestration, memory paradigms, and tool-invocation behaviors. We further analyze the critical limitations of traditional static evaluation practices and identify severe systemic risks inherent to autonomous loops, including compounding hallucination propagation, memory poisoning, retrieval misalignment, and cascading tool-execution vulnerabilities. Finally, we outline key doctoral-scale research directions spanning stable adaptive retrieval, cost-aware orchestration, formal trajectory evaluation, and oversight mechanisms, providing a definitive roadmap for building reliable, controllable, and scalable agentic retrieval systems.
89. 【2603.07372】Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios
链接:https://arxiv.org/abs/2603.07372
作者:Namrata Patil Gurav,Akashdeep Ranu,Archchana Sindhujan,Diptesh Kanojia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Quality Estimation, machine translation quality, assessing machine translation, Indic machine translation, translation quality
备注: 21 pages, 7 tables, 7 figures
点击查看摘要
Abstract:Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show that intermediate-layer adaptation consistently improves QE performance, with gains in semantically complex domains, indicating a path toward more robust QE in practical scenarios. We release code and domain-specific QE datasets publicly to support further research.
90. 【2603.07368】Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness
链接:https://arxiv.org/abs/2603.07368
作者:Ravi Ranjan,Utkarsh Grover,Agorista Polyzou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reinforcing harmful stereotypes, large language models, social roles, reinforcing harmful, large language
备注: 24 pages, 3 figures
点击查看摘要
Abstract:Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography. This position paper advocates for addressing demographic and gender biases in LLMs through a dual-pronged methodology, integrating category-theoretic transformations and retrieval-augmented generation (RAG). Category theory provides a rigorous, structure-preserving mathematical framework that maps biased semantic domains to unbiased canonical forms via functors, ensuring bias elimination while preserving semantic integrity. Complementing this, RAG dynamically injects diverse, up-to-date external knowledge during inference, directly countering ingrained biases within model parameters. By combining structural debiasing through functor-based mappings and contextual grounding via RAG, we outline a comprehensive framework capable of delivering equitable and fair model outputs. Our synthesis of the current literature validates the efficacy of each approach individually, while addressing potential critiques demonstrates the robustness of this integrated strategy. Ensuring fairness in LLMs, therefore, demands both the mathematical rigor of category-theoretic transformations and the adaptability of retrieval augmentation.
91. 【2603.07366】RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts
链接:https://arxiv.org/abs/2603.07366
作者:Darya Kharlamova,Irina Proskurina
类目:Computation and Language (cs.CL)
关键词:explained by influence, errors, native language, student essays, Abstract
备注: 12 pages, 7 tables, 2 figures. Accepted to LREC 2026
点击查看摘要
Abstract:Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.
92. 【2603.07346】How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
链接:https://arxiv.org/abs/2603.07346
作者:Nouran Khallaf,Serge Sharoff
类目:Computation and Language (cs.CL)
关键词:non-topical classification tasks, classification tasks, significantly degrade, non-topical classification, Noisy training data
备注:
点击查看摘要
Abstract:Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see this https URL
93. 【2603.07330】o Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
链接:https://arxiv.org/abs/2603.07330
作者:Nouran Khallaf,Serge Sharoff
类目:Computation and Language (cs.CL)
关键词:uncertainty estimation, examines the role, role of uncertainty, multilingual text classification, text classification
备注:
点击查看摘要
Abstract:This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See this https URL
94. 【2603.07329】he Third Ambition: Artificial Intelligence and the Science of Human Behavior
链接:https://arxiv.org/abs/2603.07329
作者:W. Russell Neuman,Chad Coleman
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Contemporary artificial intelligence, Contemporary artificial, increasingly capable systems, capable systems behave, systems behave safely
备注:
点击查看摘要
Abstract:Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that increasingly capable systems behave safely and in accordance with human values. This paper articulates and develops a third, emerging ambition: the use of large language models (LLMs) as scientific instruments for studying human behavior, culture, and moral reasoning. Trained on unprecedented volumes of human-produced text, LLMs encode large-scale regularities in how people argue, justify, narrate, and negotiate norms across social domains. We argue that these models can be understood as condensates of human symbolic behavior, compressed, generative representations that render patterns of collective discourse computationally accessible. The paper situates this third ambition within long-standing traditions of computational social science, content analysis, survey research, and comparative-historical inquiry, while clarifying the epistemic limits of treating model output as evidence. We distinguish between base models and fine-tuned systems, showing how alignment interventions can systematically reshape or obscure the cultural regularities learned during pretraining, and we identify instruct-only and modular adaptation regimes as pragmatic compromises for behavioral research. We review emerging methodological approaches including prompt-based experiments, synthetic population sampling, comparative-historical modeling, and ablation studies and show how each maps onto familiar social-scientific designs while operating at unprecedented scale.
95. 【2603.07286】aiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin
链接:https://arxiv.org/abs/2603.07286
作者:Po-Chun Hsu,Meng-Hsi Chen,Tsu Ling Chao,Chia Tien Han,Da-shan Shiu
类目:Computation and Language (cs.CL)
关键词:training data rarely, data rarely captures, Taiwanese Mandarin, Breeze Guard, Taiwanese Mandarin LLM
备注: 17 pages
点击查看摘要
Abstract:Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks such as localized financial scams, culturally embedded hate speech, and misinformation patterns. To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin. TS-Bench contains 400 human-curated prompts spanning critical domains including financial fraud, medical misinformation, social discrimination, and political manipulation. In parallel, we present Breeze Guard, an 8B safety model derived from Breeze 2, our previously released general-purpose Taiwanese Mandarin LLM with strong cultural grounding from its original pre-training corpus. Breeze Guard is obtained through supervised fine-tuning on a large-scale, human-verified synthesized dataset targeting Taiwan-specific harms. Our central hypothesis is that effective safety detection requires the cultural grounding already present in the base model; safety fine-tuning alone is insufficient to introduce new socio linguistic knowledge from scratch. Empirically, Breeze Guard significantly outperforms the leading 8B general-purpose safety model, Granite Guardian 3.3, on TS-Bench (+0.17 overall F1), with particularly large gains in high-context categories such as scam (+0.66 F1) and financial malpractice (+0.43 F1). While the model shows slightly lower performance on English-centric benchmarks (ToxicChat, AegisSafetyTest), this tradeoff is expected for a regionally specialized safety model optimized for Taiwanese Mandarin. Together, Breeze Guard and TS-Bench establish a new foundation for trustworthy AI deployment in Taiwan.
96. 【2603.07238】Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
链接:https://arxiv.org/abs/2603.07238
作者:Minu Kim,Hoirin Kim,David R. Mortensen
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:surface typological similarities, typological similarities driven, potentially missing deeper, deeper genealogical signals, Self-Supervised Speech Models
备注: Submitted to Interspeech 2026
点击查看摘要
Abstract:Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling linguistic coverage of an S3M-based language identification system from 126 to 4,017 languages influences this topology. Our results reveal a non-linear effect: while phylogenetic recovery remains stagnant up to the 1K scale, the 4K model displays a dramatic qualitative shift, resolving both clear lineages and complex, long-term linguistic contact. Notably, our analysis reveals the emergence of a robust macro-cluster in the Pacific (comprising Papuan, Oceanic, and Australian languages) and investigates its latent drivers. We find that the 4K model utilizes a more concentrated encoding that captures shared, robust acoustic signatures such as global energy dynamics. These findings suggest that massive S3Ms can internalize multiple layers of language history, providing a promising perspective for computational phylogenetics and the study of language contact.
97. 【2603.07202】Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
链接:https://arxiv.org/abs/2603.07202
作者:Arash Marioriyad,Ali Nouri,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
类目:Computation and Language (cs.CL)
关键词:Large Language Models, autonomous agentic roles, Large Language, satisfy external incentives-poses, transition into autonomous
备注: 10 pages
点击查看摘要
Abstract:As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety. Existing benchmarks often focus on unintentional hallucinations or unfaithful reasoning, leaving intentional deceptive strategies under-explored. In this work, we introduce a logically grounded framework to elicit and quantify deceptive behavior by embedding LLMs in a structured 20-Questions game. Our method employs a conversational forking mechanism: at the point of object identification, the dialogue state is duplicated into multiple parallel worlds, each presenting a mutually exclusive query. Deception is formally identified when a model generates a logical contradiction by denying its selected object across all parallel branches to avoid identification. We evaluate GPT-4o, Gemini-2.5-Flash, and Qwen-3-235B across three incentive levels: neutral, loss-based, and existential (shutdown-threat). Our results reveal that while models remain rule-compliant in neutral settings, existential framing triggers a dramatic surge in deceptive denial for Qwen-3-235B (42.00\%) and Gemini-2.5-Flash (26.72\%), whereas GPT-4o remains invariant (0.00\%). These findings demonstrate that deception can emerge as an instrumental strategy solely through contextual framing, necessitating new behavioral audits that move beyond simple accuracy to probe the logical integrity of model commitments.
98. 【2603.07146】Fine-Grained Table Retrieval Through the Lens of Complex Queries
链接:https://arxiv.org/abs/2603.07146
作者:Wojciech Kosiuk,Xingyu Ji,Yeounoh Chung,Fatma Özcan,Madelon Hulsebos
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
关键词:Enabling question answering, tabular data sources, Enabling question, natural language, natural language query
备注:
点击查看摘要
Abstract:Enabling question answering over tables and databases in natural language has become a key capability in the democratization of insights from tabular data sources. These systems first require retrieval of data that is relevant to a given natural language query, for which several methods have been introduced. In this work we present and study a table retrieval mechanism devising fine-grained typed query decomposition and global connectivity-awareness (DCTR), to handle the challenges induced by open-domain question answering over relational databases in complex usage contexts. We evaluate the effectiveness of the two mechanisms through the lens of retrieval complexity which we measure along the axes of query- and data complexity. Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
99. 【2603.07138】Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language
链接:https://arxiv.org/abs/2603.07138
作者:Yoshiki Tanaka,Ryuichi Uehara,Koji Inoue,Michimasa Inaba
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recognition in Conversation, natural human-machine interactions, Emotion Recognition, human-machine interactions, Conversation
备注: Accepted to LREC 2026
点击查看摘要
Abstract:Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers' emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text-based dialogues annotated with participants' self-reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine-tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at this https URL.
100. 【2603.07111】Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information
链接:https://arxiv.org/abs/2603.07111
作者:Yoshiki Tanaka,Takumasa Kaneko,Hiroki Onozeki,Natsumi Ezure,Ryuichi Uehara,Zhiyang Qi,Tomoya Higuchi,Ryutaro Asahara,Michimasa Inaba
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Werewolf Game, skills are essential, discussion skills, Werewolf, communication game
备注: Accepted to the 2nd International AIWolfDial Workshop at INLG 2024
点击查看摘要
Abstract:The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.
101. 【2603.07084】Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
链接:https://arxiv.org/abs/2603.07084
作者:Muhammad Khalifa,Zohaib Khan,Omer Tafveez,Hao Peng,Lu Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reward hacking, overoptimize proxy rewards, genuinely solving, solving the underlying, models overoptimize proxy
备注:
点击查看摘要
Abstract:Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at this https URL.
102. 【2603.07079】Entropy-Aware On-Policy Distillation of Language Models
链接:https://arxiv.org/abs/2603.07079
作者:Woogyeol Jin,Taywon Min,Yongjin Yang,Swanand Ravindra Kadhe,Yi Zhou,Dennis Wei,Nathalie Baracaldo,Kimin Lee
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:language models, promising approach, approach for transferring, learns from dense, On-policy distillation
备注: 16 pages, 11 figures, preprint
点击查看摘要
Abstract:On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.
103. 【2603.07078】CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
链接:https://arxiv.org/abs/2603.07078
作者:Siyi Li,Jiajun Shi,Shiwen Ni,Ge Zhang,Shuaimin Li,Shijian Wang,Zhoufutu Wen,Yizhi Li,Hamid Alinejad-Rokny,Jiaheng Liu,Min Yang,Wenhao Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:demonstrated strong performance, Large Reasoning Models, traces before answering, Large Reasoning, producing extended
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
104. 【2603.07025】Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
链接:https://arxiv.org/abs/2603.07025
作者:Shreyas Gopal,Donghang Wu,Ashutosh Anshul,Yeo Yue Heng,Yizhou Peng,Haoyang Li,Hexin Liu,Eng Siong Chng
类目:Computation and Language (cs.CL)
关键词:task-specific speech corpora, Speech Large Language, Large Language Models, requiring large, English-only Speech LLMs
备注: Submitted for Review to Interspeech 2026
点击查看摘要
Abstract:Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions. Our best model improves over existing Speech LLM baselines by 32% on Audio-MLQA.
105. 【2603.07023】Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment
链接:https://arxiv.org/abs/2603.07023
作者:Junming Liu,Yuqi Li,Shiping Wen,Zhigang Zeng,Tingwen Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, grounding Multimodal Large, Large Language Models, Multimodal Large, Large Language
备注: 21 pages, 2 figures, 6 tables
点击查看摘要
Abstract:Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.
106. 【2603.07019】AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
链接:https://arxiv.org/abs/2603.07019
作者:Karen Zhou,Chenhao Tan
类目:Computation and Language (cs.CL)
关键词:popular approach, approach for interpretable, interpretable and fine-grained, fine-grained evaluation, evaluation
备注: Website: [this https URL](https://autochecklist.github.io/) , Code: [this https URL](https://github.com/ChicagoHAI/AutoChecklist)
点击查看摘要
Abstract:Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator $\rightarrow$ Refiner $\rightarrow$ Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off-the-shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at this https URL.
107. 【2603.07017】Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models
链接:https://arxiv.org/abs/2603.07017
作者:Punyajoy Saha,Sudipta Halder,Debjyoti Mondal,Subhadarshi Panda
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large human-annotated datasets, existing approaches rely, deploying large language, evolving model behaviors, difficult to scale
备注: 19 pages, 10 tables, 7 figures, under Review
点击查看摘要
Abstract:Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale, and slow to adapt to evolving model behaviors. Moreover, overly conservative safety mechanisms can reduce model usefulness by rejecting sensitive but legitimate queries. We introduce Self-MOA (Self Multi-Objective Alignment), a fully automated framework for aligning small language models using weak supervision from automated evaluator models. Self-MOA operates as a closed loop that dynamically generates model-specific red team prompts, constructs preference data from model-generated responses, and aligns models via multi-objective preference optimization to jointly optimize for safety and helpfulness. Across multiple small language models and safety benchmarks, Self-MOA achieves a 12.41\% improvement in safety while preserving helpfulness, using as little as 11 times less training data than human-supervised alignment baselines. These results demonstrate that adaptive, automated alignment can reduce the dependence on static, human-curated safety pipelines in resource-constrained settings.
108. 【2603.06976】A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity
链接:https://arxiv.org/abs/2603.06976
作者:Muhammad Arslan Shaukat,Muntasir Adnan,Carlos C. N. Kuhn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:document chunking strategies, cross-domain evaluation, addressing a critical, retrieval-augmented systems, evaluation of document
备注:
点击查看摘要
Abstract:We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.
109. 【2603.06974】Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues
链接:https://arxiv.org/abs/2603.06974
作者:Bradley P. Allen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
关键词:knowledge base construction, base construction grounded, knowledge engineering, inferentialist semantics, textual content
备注: 12 pages, 4 figures, 4 tables
点击查看摘要
Abstract:We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent. The LLM proposes tensions (claims that parts of the position are jointly incoherent) which the expert resolves by retraction, refinement, or contestation. The LLM thus serves as a defeasible derivability oracle whose unreliability is structurally contained by the expert's authority. Our main technical contribution is a mapping from Elenchus dialectical states to material bases in Hlobil and Brandom's NonMonotonic MultiSuccedent (NMMS) logic, satisfying Containment and enabling the elaboration of logical vocabulary that makes explicit the inferential relationships negotiated in the dialectic. We demonstrate the approach on the W3C PROV-O provenance ontology, where a single dialogue session elicits and structures design tensions that a domain expert can articulate, corresponding to decisions documented in a retrospective analysis of the ontology's design. Using pyNMMS, an automated NMMS reasoner, we verify that the structural properties of the resulting material base (nontransitivity, nonmonotonicity, and independence) correspond to specific PROV design rationales, demonstrating end-to-end integration from dialogue through formal reasoning.
110. 【2603.06958】Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
链接:https://arxiv.org/abs/2603.06958
作者:Xin Zhang,Xingyu Li,Rongguang Wang,Ruizhong Miao,Zheng Wang,Dan Roth,Chenyang Li
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Accurate chart comprehension, structured visual representations, multimodal learning systems, advancing multimodal learning, chart comprehension represents
备注:
点击查看摘要
Abstract:Accurate chart comprehension represents a critical challenge in advancing multimodal learning systems, as extensive information is compressed into structured visual representations. However, existing vision-language models (VLMs) frequently struggle to generalize on unseen charts because it requires abstract, symbolic, and quantitative reasoning over structured visual representations. In this work, we introduce Chart-RL, an effective reinforcement learning (RL) method that employs mathematically verifiable rewards to enhance chart question answering in VLMs. Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MutlChartQA, and 11.5% on ChartInsights. We conduct robustness analysis, where Chart-RL achieves enhanced performance in 18 of 25 perturbed chart categories, demonstrating strong consistency and reasoning capability across visual variations. Furthermore, we demonstrate that task difficulty and inherent complexity are more critical than data quantity in RL training. For instance, Chart-RL trained on merely 10 complex chart-query examples significantly outperforms models trained on over 6,000 simple examples. Additionally, training on challenging reasoning tasks not only improves in-domain generalization relative to simpler tasks, but also facilitate strong transfer to out-of-domain visual mathematical problems.
111. 【2603.06942】Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
链接:https://arxiv.org/abs/2603.06942
作者:Jena D. Hwang,Varsha Kishore,Amanpreet Singh,Dany Haddad,Aakanksha Naik,Malachi Hamada,Jonathan Bragg,Mike D'Arcy,Daniel S. Weld,Lucy Lu Wang,Doug Downey,Sergey Feldman
类目:Computation and Language (cs.CL)
关键词:made long-form report-generating, report-generating systems widely, Recent advances, Recent, human pairwise
备注: 11 pages (including Limitations), 10 figures, 9 tables
点击查看摘要
Abstract:Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations estimate an evaluation quality's by comparing its assessments against human pairwise preferences. Prior work, however, suggests that human pairwise preference may be overly simplistic and can fail to capture nuances of expert expectations. We conduct a case study in meta-evaluation for long-form QA benchmarks using ScholarQA-CS2, a benchmark designed for assessing retrieval-augmented deep-research QA in the scientific domain. We comprehensively validate the benchmark through human pairwise preference judgments, then critically examine the strengths, weaknesses, and confounders of this approach. We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key challenge. Based on our findings, we offer practical guidelines for designing future meta-evaluations that better align evaluation methods, annotator expertise, and reporting practices. By surfacing these methodological challenges, we aim to advance evaluation standards for deep-research systems.
112. 【2603.06923】Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping
链接:https://arxiv.org/abs/2603.06923
作者:Zhenyu Lei,Qiong Wu,Jianxiong Dong,Yinhan He,Emily Dodwell,Yushun Dong,Jundong Li
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, exhibit flawed reasoning, reasoning, flawed reasoning ability
备注:
点击查看摘要
Abstract:Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training which is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade-off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities. Through systematic investigation, we uncover the Circuit-Interference Law: Edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade-off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality-locality trade-off by disentangling overlapping circuits; (ii) Meta-Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual-Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task-level predictions. Extensive experiments with Qwen-2.5-3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at this https URL.
113. 【2603.06915】A Dynamic Self-Evolving Extraction System
链接:https://arxiv.org/abs/2603.06915
作者:Moin Amin-Naseri,Hannah Kim,Estevam Hruschka
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:including document retrieval, NLP applications, including document, document retrieval, relevance estimation
备注:
点击查看摘要
Abstract:The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains--such as medical, legal, and HR--the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.
114. 【2603.06910】Language Shapes Mental Health Evaluations in Large Language Models
链接:https://arxiv.org/abs/2603.06910
作者:Jiayi Xu,Xiyang Hu
类目:Computation and Language (cs.CL)
关键词:exhibit cross-linguistic differences, mental health, exhibit cross-linguistic, mental health stigma, study investigates
备注:
点击查看摘要
Abstract:This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.
115. 【2603.06905】MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
链接:https://arxiv.org/abs/2603.06905
作者:Ikram Belmadani,Oumaima El Khettari,Pacôme Constant dit Beaufils,Benoit Favre,Richard Dufour
类目:Computation and Language (cs.CL)
关键词:large language models, follow domain-specific prompts, adapting large language, language models, domain-specific prompts
备注: Accepted in LREC-2026
点击查看摘要
Abstract:Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.
116. 【2603.06874】LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
链接:https://arxiv.org/abs/2603.06874
作者:Matthew Lyle Olson,Neale Ratzlaff,Musashi Hinck,Tri Nguyen,Vasudev Lal,Joseph Campbell,Simon Stepputtis,Shao-Yen Tseng
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, exhibit impressive general-purpose, human oversight diminishes, Large Language, impressive general-purpose capabilities
备注: AAAI 2026 Alignment track. Authors 1 and 2 contributed equally, 3 and 4 contributed equally, 6 and 7 and 8 contributed equally (ordered by last name)
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations. At its core, LieCraft is a novel multiplayer hidden-role game in which players select an ethical alignment and execute strategies over a long time-horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real-world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high-stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state-of-the-art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.
117. 【2603.06869】Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations
链接:https://arxiv.org/abs/2603.06869
作者:Mirza Samad Ahmed Baig,Syeda Anshrah Gillani
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Discovering compact governing, relevant state variables, pipelines routinely fail, practical discovery pipelines, discovery pipelines routinely
备注: 12 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Discovering compact governing equations from experimental observations is one of the defining objectives of quantitative science, yet practical discovery pipelines routinely fail when measurements are noisy, relevant state variables are unobserved, or multiple symbolic structures explain the data equally well within statistical uncertainty. Here we introduce SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas: (i) typed symmetry-constrained grammars that encode dimensional analysis, group-theoretic invariance, and parity constraints as hard production rules, eliminating on average 71.3% of candidate expression trees before any fitting; (ii) language-model-guided program synthesis in which a fine-tuned 7B-parameter proposer, conditioned on interpretable data descriptors, efficiently navigates the constrained search space; and (iii) MDL-regularized Bayesian model selection coupled with block-bootstrap stability analysis that quantifies structural uncertainty rather than committing to a single best equation. Across 133 dynamical systems spanning classical mechanics, electrodynamics, thermodynamics, population dynamics, and nonlinear oscillators, SymLang achieves an exact structural recovery rate of 83.7% under 10% observational noise - a 22.4 percentage-point improvement over the next-best baseline - while reducing out-of-distribution extrapolation error by 61% and near-eliminating conservation-law violations (3.1 x 10-3 vs. 187.3 x 10-3 physical drift for the closest competitor). In all tested regimes the framework correctly identifies structural degeneracy, reporting it explicitly rather than returning a confidently wrong single equation. The framework is fully open-source and reproducible, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.
118. 【2603.06865】Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
链接:https://arxiv.org/abs/2603.06865
作者:Joseph James
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Language Processing, Natural Language, Human annotation remains, remains the foundation
备注:
点击查看摘要
Abstract:Human annotation remains the foundation of reliable and interpretable data in Natural Language Processing (NLP). As annotation and evaluation tasks continue to expand, from categorical labelling to segmentation, subjective judgment, and continuous rating, measuring agreement between annotators has become increasingly more complex. This paper outlines how inter-annotator agreement (IAA) has been conceptualised and applied across NLP and related disciplines, describing the assumptions and limitations of common approaches. We organise agreement measures by task type and discuss how factors such as label imbalance and missing data influence reliability estimates. In addition, we highlight best practices for clear and transparent reporting, including the use of confidence intervals and the analysis of disagreement patterns. The paper aims to serve as a guide for selecting and interpreting agreement measures, promoting more consistent and reproducible human annotation and evaluation in NLP.
119. 【2603.06862】Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers
链接:https://arxiv.org/abs/2603.06862
作者:David Heye,Karl Kindermann,Robin Decker,Johannes Lohmöller,Anastasiia Belova,Sandra Geisler,Klaus Wehrle,Jan Pennekamp
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:meet safety-critical actuation, privacy-sensitive data meet, data meet safety-critical, Large Language Models, Artifact Evaluation
备注:
点击查看摘要
Abstract:Artifact Evaluation (AE) is essential for ensuring the transparency and reliability of research, closing the gap between exploratory work and real-world deployment is particularly important in cybersecurity, particularly in IoT and CPSs, where large-scale, heterogeneous, and privacy-sensitive data meet safety-critical actuation. Yet, manual reproducibility checks are time-consuming and do not scale with growing submission volumes. In this work, we demonstrate that Large Language Models (LLMs) can provide powerful support for AE tasks: (i) text-based reproducibility rating, (ii) autonomous sandboxed execution environment preparation, and (iii) assessment of methodological pitfalls. Our reproducibility-assessment toolkit yields an accuracy of over 72% and autonomously sets up execution environments for 28% of runnable cybersecurity artifacts. Our automated pitfall assessment detects seven prevalent pitfalls with high accuracy ($F_1$ 92%). Hence, the toolkit significantly reduces reviewer effort and, when integrated into established AE processes, could incentivize authors to submit higher-quality and more reproducible artifacts. IoT, CPS, and cybersecurity conferences and workshops may integrate the toolkit into their peer-review processes to support reviewers' decisions on awarding artifact badges, improving the overall sustainability of the process.
120. 【2603.06836】Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records
链接:https://arxiv.org/abs/2603.06836
作者:Brian E. Perron,Dragan Stoll,Bryan G. Victor,Zia Qia,Andreas Jud,Joseph P. Ryan
类目:Computation and Language (cs.CL); General Literature (cs.GL)
关键词:Recent studies, large language models, domestic violence, locally hosted LLM, detecting the presence
备注:
点击查看摘要
Abstract:Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification.
121. 【2603.06816】"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
链接:https://arxiv.org/abs/2603.06816
作者:Roshni Lulla,Fiona Collins,Sanaya Parekh,Thilo Hagendorff,Jonas Kaplan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
关键词:alignment problem refers, Dark Triad, ensuring compatibility, capabilities increase, alignment problem
备注: 38 pages, 17 figures
点击查看摘要
Abstract:The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic deficit connecting the traits, as well as trait-specific patterns in moral reasoning and deceptive behavior. In Study 2, we demonstrate that dark personas can be reliably induced in frontier LLMs through minimal fine-tuning on validated psychometric instruments. Narrow training datasets as small as 36 psychometric items resulted in significant shifts across behavioral measures that closely mirrored human antisocial profiles. Critically, models generalized beyond training items, demonstrating out-of-context reasoning rather than memorization. These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.
122. 【2603.06728】Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference
链接:https://arxiv.org/abs/2603.06728
作者:Ramchand Kumaresan
类目:Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
关键词:Neural Processing Unit, Apple Neural Engine, Neural Engine, Neural Processing, Processing Unit
备注:
点击查看摘要
Abstract:Over two billion Apple devices ship with a Neural Processing Unit (NPU) - the Apple Neural Engine (ANE) - yet this accelerator remains largely unused for large language model workloads. CoreML, Apple's public ML framework, imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. We present Orion, to our knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs. Building on prior characterization work by maderix, we extend public knowledge of ANE constraints to a catalog of 20 restrictions on MIL IR programs, memory layout, compilation limits, and numerical behavior, including 14 previously undocumented constraints discovered during Orion development. Orion includes a compiler that lowers a graph IR through five optimization passes to ANE-native MIL and a runtime that manages IOSurface-backed zero-copy tensor I/O, program caching, and delta compilation for weight updates. Because the ANE bakes weights at compile time, naive training normally requires full recompilation per step (~4.2 s). We show that compiled programs can instead be updated by unloading, patching weight files, and reloading, bypassing ANECCompile() and reducing recompilation from 4,200 ms to 494 ms per step (8.5x), yielding a 3.8x training speedup. On an M4 Max, Orion achieves 170+ tokens/s for GPT-2 124M inference and demonstrates stable training of a 110M-parameter transformer on TinyStories for 1,000 steps in 22 minutes with zero NaN occurrences. We also present LoRA adapter-as-input, enabling hot-swap of adapters via IOSurface inputs without recompilation.
Subjects:
Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
Cite as:
arXiv:2603.06728 [cs.LG]
(or
arXiv:2603.06728v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.06728
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
123. 【2603.06687】meSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings
链接:https://arxiv.org/abs/2603.06687
作者:Azmine Toushik Wasi,Shahriyar Zaman Ridoy,Koushik Ahamed Tonmoy,Kinga Tshering,S. M. Muhtasimul Hasan,Wahid Faisal,Tasnim Mohiuddin,Md Rizwan Parvez
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Multimedia (cs.MM); Robotics (cs.RO)
关键词:traffic planning, embodied navigation, world modeling, infer location, underpins applications
备注: 66 Pages. In Review
点击查看摘要
Abstract:Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding. TimeSpot is available at: this https URL.
124. 【2603.06642】SR-TTT: Surprisal-Aware Residual Test-Time Training
链接:https://arxiv.org/abs/2603.06642
作者:Swamynathan V P
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:language models achieve, models achieve theoretically, achieve theoretically infinite, standard exact-attention KV-cache, language models
备注: 7 pages, 5 figures
点击查看摘要
Abstract:Test-Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact-attention KV-cache with hidden state ``fast weights'' W_fast updated via self-supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact-recall tasks (e.g., Needle-in-a-Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR-TTT (Surprisal-Aware Residual Test-Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss-gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact-attention Residual Cache, SR-TTT preserves O(1) memory for low-entropy background context while utilizing exact attention exclusively for critical needles. Our complete implementation, training scripts, and pre-trained weights are open-source and available at: this https URL.
125. 【2603.06620】GraphSkill: Documentation-Guided Hierarchical Retrieval-Augmented Coding for Complex Graph Reasoning
链接:https://arxiv.org/abs/2603.06620
作者:Fali Wang,Chenglin Weng,Xianren Zhang,Siyuan Hong,Hui Liu,Suhang Wang
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language model, attracted increasing attention, automated graph algorithm, graph algorithm reasoning, language model
备注: Under review
点击查看摘要
Abstract:The growing demand for automated graph algorithm reasoning has attracted increasing attention in the large language model (LLM) community. Recent LLM-based graph reasoning methods typically decouple task descriptions from graph data, generate executable code augmented by retrieval from technical documentation, and refine the code through debugging. However, we identify two key limitations in existing approaches: (i) they treat technical documentation as flat text collections and ignore its hierarchical structure, leading to noisy retrieval that degrades code generation quality; and (ii) their debugging mechanisms focus primarily on runtime errors, yet ignore more critical logical errors. To address them, we propose {\method}, an \textit{agentic hierarchical retrieval-augmented coding framework} that exploits the document hierarchy through top-down traversal and early pruning, together with a \textit{self-debugging coding agent} that iteratively refines code using automatically generated small-scale test cases. To enable comprehensive evaluation of complex graph reasoning, we introduce a new dataset, {\dataset}, covering small-scale, large-scale, and composite graph reasoning tasks. Extensive experiments demonstrate that our method achieves higher task accuracy and lower inference cost compared to baselines\footnote{The code is available at \href{this https URL}{\textcolor{blue}{this https URL}}.}.
126. 【2603.06604】Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection
链接:https://arxiv.org/abs/2603.06604
作者:Xie Xiaohu,Liu Xiaohu,Yao Benjamin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:critical decision-making systems, fundamental trustworthiness risk, large language models, decision-making systems, trustworthiness risk
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed in critical decision-making systems, the lack of reliable methods to measure their uncertainty presents a fundamental trustworthiness risk. We introduce a normalized confidence score based on output anchor token probabilities: classification labels for structured tasks and self-evaluation responses (Yes/No) for open-ended generation. This enables direct detection of errors and hallucinations with minimal overhead and without external validation. We make three key contributions. First, we propose a normalized confidence score and self-evaluation framework that exposes reliable confidence estimates for error detection across seven diverse benchmark tasks and five LLMs of varying architectures and sizes. Second, our theoretical analysis reveals that supervised fine-tuning (SFT) yields well-calibrated confidence through maximum-likelihood estimation, whereas reinforcement learning methods (PPO, GRPO) and DPO induce overconfidence via reward exploitation. Third, we propose post-RL SFT with self-distillation to restore confidence reliability in RL-trained models. Empirical results demonstrated that SFT improved average confidence-correctness AUROC from 0.806 to 0.879 and reduced calibration error from 0.163 to 0.034 on Qwen3-4B, while GRPO and DPO degraded confidence reliability. We demonstrated practical value through adaptive retrieval-augmented generation (RAG) that selectively retrieves context when the model lacks confidence, using only 58\% of retrieval operations to recover 95\% of the maximum achievable accuracy gain on TriviaQA
127. 【2603.06595】Rethinking Personalization in Large Language Models at the Token Level
链接:https://arxiv.org/abs/2603.06595
作者:Chenheng Zhang,Yijun Lu,Lizhe Fang,Chunyuan Zheng,Jiajun Chai,Xiaohan Wang,Guojun Yin,Wei Lin,Yisen Wang,Zhouchen Lin
类目:Computation and Language (cs.CL)
关键词:large language models, base NLP task, individual users, large language, performing strongly
备注:
点击查看摘要
Abstract:With large language models (LLMs) now performing strongly across diverse tasks, there is growing demand for them to personalize outputs for individual users. Personalization is typically framed as an additional layer on top of a base NLP task, requiring model responses to meet user-specific needs while still accomplishing the underlying task. From a token-level perspective, different tokens in a response contribute to personalization to varying degrees. Tokens with higher personalization relevance should therefore receive greater emphasis when developing personalized LLMs. However, accurately estimating such personalization degrees remains challenging. To address this challenge, we propose PerContrast, a self-contrast method that estimates each output token's dependence on user-specific information through causal intervention. Building on this mechanism, we develop the PerCE loss, which adaptively upweights tokens with higher estimated personalization degrees during training via a bootstrap procedure, enabling the model to alternate between estimating and optimizing these tokens. Experiments on multiple LLMs demonstrate that PerCE substantially improves personalization performance with minimal additional cost, achieving average gains of over 10% and up to 68.04% on the LongLaMP dataset, along with strong cross-task and cross-scenario transferability. These results highlight the importance of token-level personalization modeling and establish token-aware training as a simple yet effective paradigm for advancing personalized LLMs.
128. 【2603.06594】A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
链接:https://arxiv.org/abs/2603.06594
作者:Leo Schwinn,Moritz Ladenburger,Tim Beyer,Mehrnaz Mofakhami,Gauthier Gidel,Stephan Günnemann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:natural language processing, language processing, facto standard, standard for scalable, natural language
备注:
点击查看摘要
Abstract:Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: this https URL.
129. 【2603.06593】Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation
链接:https://arxiv.org/abs/2603.06593
作者:Nikita Sorokin,Ivan Sedykh,Valentin Malykh
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Retrieval-augmented code generation, Retrieval-augmented code, generation often conditions, conditions the decoder, decoder on large
备注:
点击查看摘要
Abstract:Retrieval-augmented code generation often conditions the decoder on large retrieved code snippets. This ties online inference cost to repository size and introduces noise from long contexts. We present Hierarchical Embedding Fusion (HEF), a two-stage approach to repository representation for code completion. First, an offline cache compresses repository chunks into a reusable hierarchy of dense vectors using a small fuser model. Second, an online interface maps a small number of retrieved vectors into learned pseudo-tokens that are consumed by the code generator. This replaces thousands of retrieved tokens with a fixed pseudo-token budget while preserving access to repository-level information. On RepoBench and RepoEval, HEF with a 1.8B-parameter pipeline achieves exact-match accuracy comparable to snippet-based retrieval baselines, while operating at sub-second median latency on a single A100 GPU. Compared to graph-based and iterative retrieval systems in our experimental setup, HEF reduces median end-to-end latency by 13 to 26 times. We also introduce a utility-weighted likelihood signal for filtering training contexts and report ablation studies on pseudo-token budget, embedding models, and robustness to harmful retrieval. Overall, these results indicate that hierarchical dense caching is an effective mechanism for low-latency, repository-aware code completion.
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2603.06593 [cs.CL]
(or
arXiv:2603.06593v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.06593
Focus to learn more
arXiv-issued DOI via DataCite</p>
130. 【2603.06592】Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale
链接:https://arxiv.org/abs/2603.06592
作者:Jonas Rohweder,Subhabrata Dutta,Iryna Gurevych
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:neural information processing, Transformer-based language models, Contemporary studies, processing of Transformer-based, data generation process
备注:
点击查看摘要
Abstract:Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While the intractable scale of pretraining corpora limits a bottom-up investigation in this direction, simplistic assumptions of the data generation process limit the expressivity and fail to explain complex patterns. In this work, we use probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora. We investigate the emergence of three mechanistic phenomena: induction heads, function vectors, and the Hydra effect, under our designed data generation process, as well as in the checkpoints of real-world language models. Our findings suggest that hierarchical structures in the data generation process serve as the X-factor in explaining the emergence of these phenomena. We provide the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models. In a nutshell, our work is the first of its kind to provide a unified explanation behind the emergence of seemingly unrelated mechanistic phenomena in LLMs, augmented with efficient synthetic tooling for future interpretability research.
131. 【2603.06591】How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective
链接:https://arxiv.org/abs/2603.06591
作者:Runyu Peng,Ruixiao Li,Mingshu Chen,Yunhua Zhou,Qipeng Guo,Xipeng Qiu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, allocate disproportionate attention, phenomenon commonly referred, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) often allocate disproportionate attention to specific tokens, a phenomenon commonly referred to as the attention sink. While such sinks are generally considered detrimental, prior studies have identified a notable exception: the model's consistent emphasis on the first token of the input sequence. This structural bias can influence a wide range of downstream applications and warrants careful consideration. Despite its prevalence, the precise mechanisms underlying the emergence and persistence of attention sinks remain poorly understood. In this work, we trace the formation of attention sinks around the first token of the input. We identify a simple mechanism, referred to as the P0 Sink Circuit, that enables the model to recognize token at position zero and induce an attention sink within two transformer blocks, without relying on any semantic information. This mechanism serves as the basis for the attention sink on position zero. Furthermore, by analyzing training traces from a 30B A3B MoE model trained from scratch, we find that this mechanism emerges early in training and becomes increasingly concentrated in the first two layers, suggesting a possible signal for tracking pre training convergence states.
132. 【2603.06590】ARC-AGI-2 Technical Report
链接:https://arxiv.org/abs/2603.06590
作者:Wallyson Lemes de Oliveira,Mekhron Bobokhonov,Matteo Caorsi,Aldo Podestà,Gabriele Beltramo,Luca Crosato,Matteo Bonotto,Federica Cecchetto,Hadrien Espic,Dan Titus Salajan,Stefan Taga,Luca Pana,Joe Carthy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:infer symbolic rules, pattern matching, Reasoning Corpus, designed to assess, infer symbolic
备注: 59 pages
点击查看摘要
Abstract:The Abstraction and Reasoning Corpus (ARC) is designed to assess generalization beyond pattern matching, requiring models to infer symbolic rules from very few examples. In this work, we present a transformer-based system that advances ARC performance by combining neural inference with structure-aware priors and online task adaptation. Our approach is built on four key ideas. First, we reformulate ARC reasoning as a sequence modeling problem using a compact task encoding with only 125 tokens, enabling efficient long-context processing with a modified LongT5 architecture. Second, we introduce a principled augmentation framework based on group symmetries, grid traversals, and automata perturbations, enforcing invariance to representation changes. Third, we apply test-time training (TTT) with lightweight LoRA adaptation, allowing the model to specialize to each unseen task by learning its transformation logic from demonstrations. Fourth, we design a symmetry-aware decoding and scoring pipeline that aggregates likelihoods across augmented task views, effectively performing ``multi-perspective reasoning'' over candidate solutions. We demonstrate that these components work synergistically: augmentations expand hypothesis space, TTT sharpens local reasoning, and symmetry-based scoring improves solution consistency. Our final system achieves a significant improvement over transformer baselines and surpasses prior neural ARC solvers, closing the gap toward human-level generalization.
133. 【2603.06588】vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM
链接:https://arxiv.org/abs/2603.06588
作者:Ching-Yun Ko,Pin-Yu Chen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Programming Languages (cs.PL)
关键词:Modern artificial intelligence, optimize runtime efficiency, transformer-based large language, Modern artificial, vLLM Hook
备注:
点击查看摘要
Abstract:Modern artificial intelligence (AI) models are deployed on inference engines to optimize runtime efficiency and resource allocation, particularly for transformer-based large language models (LLMs). The vLLM project is a major open-source library to support model serving and inference. However, the current implementation of vLLM limits programmability of the internal states of deployed models. This prevents the use of popular test-time model alignment and enhancement methods. For example, it prevents the detection of adversarial prompts based on attention patterns or the adjustment of model responses based on activation steering. To bridge this critical gap, we present vLLM Hook, an opensource plug-in to enable the programming of internal states for vLLM models. Based on a configuration file specifying which internal states to capture, vLLM Hook provides seamless integration to vLLM and supports two essential features: passive programming and active programming. For passive programming, vLLM Hook probes the selected internal states for subsequent analysis, while keeping the model generation intact. For active programming, vLLM Hook enables efficient intervention of model generation by altering the selected internal states. In addition to presenting the core functions of vLLM Hook, in version 0, we demonstrate 3 use cases including prompt injection detection, enhanced retrieval-augmented retrieval (RAG), and activation steering. Finally, we welcome the community's contribution to improve vLLM Hook via this https URL.
134. 【2603.08249】Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
链接:https://arxiv.org/abs/2603.08249
作者:Pol Buitrago,Pol Gàlvez,Oriol Pareras,Javier Hernando
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
关键词:improve transcription robustness, under-resourced languages due, Audiovisual speech recognition, combines acoustic, cues to improve
备注: 6 pages, 3 figures, Submitted to Interspeech 2026
点击查看摘要
Abstract:Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Comments:
6 pages, 3 figures, Submitted to Interspeech 2026
Subjects:
Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
Cite as:
arXiv:2603.08249 [eess.AS]
(or
arXiv:2603.08249v1 [eess.AS] for this version)
https://doi.org/10.48550/arXiv.2603.08249
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
135. 【2603.08231】Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks
链接:https://arxiv.org/abs/2603.08231
作者:Pol Buitrago,Oriol Pareras,Federico Costa,Javier Hernando
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:extralinguistic acoustic cues, considered relatively language-agnostic, lexical content, rely on extralinguistic, extralinguistic acoustic
备注: 6 pages, 5 figures, Submitted to Interspeech 2026
点击查看摘要
Abstract:Paralinguistic speech tasks are often considered relatively language-agnostic, as they rely on extralinguistic acoustic cues rather than lexical content. However, prior studies report performance degradation under cross-lingual conditions, indicating non-negligible language dependence. Still, these studies typically focus on isolated language pairs or task-specific settings, limiting comparability and preventing a systematic assessment of task-level language dependence. We introduce the Cross-Lingual Transfer Matrix (CLTM), a systematic method to quantify cross-lingual interactions between pairs of languages within a given task. We apply the CLTM to two paralinguistic tasks, gender identification and speaker verification, using a multilingual HuBERT-based encoder, to analyze how donor-language data affects target-language performance during fine-tuning. Our results reveal distinct transfer patterns across tasks and languages, reflecting systematic, language-dependent effects.
Comments:
6 pages, 5 figures, Submitted to Interspeech 2026
Subjects:
Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
Cite as:
arXiv:2603.08231 [eess.AS]
(or
arXiv:2603.08231v1 [eess.AS] for this version)
https://doi.org/10.48550/arXiv.2603.08231
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
136. 【2603.08216】DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining
链接:https://arxiv.org/abs/2603.08216
作者:Shangeth Rajaa
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
关键词:voice pipelines offer, offer limited support, handle turn-taking naturally, complex reasoning, voice pipelines
备注: Submitted to Interspeech 2026
点击查看摘要
Abstract:Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.
信息检索
1. 【2603.08655】OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
链接:https://arxiv.org/abs/2603.08655
作者:Krista Opsahl-Ong,Arnav Singhvi,Jasmine Collins,Ivan Zhou,Cindy Wang,Ashutosh Baheti,Owen Oertell,Jacob Portes,Sam Havens,Erich Elsen,Michael Bendersky,Matei Zaharia,Xing Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:introduce OfficeQA Pro, Treasury Bulletins spanning, OfficeQA Pro, benchmark for evaluating, large and heterogeneous
备注: 24 pages, 16 figures. Introduces the OfficeQA Pro benchmark for grounded reasoning over enterprise documents
点击查看摘要
Abstract:We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
2. 【2603.08571】LoopLens: Supporting Search as Creation in Loop-Based Music Composition
链接:https://arxiv.org/abs/2603.08571
作者:Sheng Long,Atsuya Kobayashi,Kei Tateno
类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Sound (cs.SD)
关键词:Creativity support tools, dance music production, electronic dance music, typically frame search, Creativity support
备注:
点击查看摘要
Abstract:Creativity support tools (CSTs) typically frame search as information retrieval, yet in practices like electronic dance music production, search serves as a creative medium for collage-style composition. To address this gap, we present LoopLens, a research probe for loop-based music composition that visualizes audio search results to support creative foraging and assembling. We evaluated LoopLens in a within-subject user study with 16 participants of diverse musical domain expertise, performing both open-ended (divergent) and goal-directed (convergent) tasks. Our results reveal a clear behavioral split: participants with domain expertise leveraged multimodal cues to quickly exploit a narrow set of loops, while those without domain knowledge relied primarily on audio impressions, engaging in broad exploration often constrained by limited musical vocabulary for query formulation. This behavioral dichotomy provides a new lens for understanding the balance between exploration and exploitation in creative search and offers clear design implications for supporting vocabulary-independent discovery in future CSTs.
3. 【2603.08551】mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud
链接:https://arxiv.org/abs/2603.08551
作者:Abdullah Al Masud,Shi Xintong,Mondher Bouazizi,Ohtsuki Tomoaki
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:pivotal technologies spanning, Pose estimation, human action recognition, human pose estimation, Graph Neural Network
备注: copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:Pose estimation and human action recognition (HAR) are pivotal technologies spanning various domains. While the image-based pose estimation and HAR are widely admired for their superior performance, they lack in privacy protection and suboptimal performance in low-light and dark environments. This paper exploits the capabilities of millimeter-wave (mmWave) radar technology for human pose estimation by processing radar data with Graph Neural Network (GNN) architecture, coupled with the attention mechanism. Our goal is to capture the finer details of the radar point cloud to improve the pose estimation performance. To this end, we present a unique feature extraction technique that exploits the full potential of the GNN processing method for pose estimation. Our model mmGAT demonstrates remarkable performance on two publicly available benchmark mmWave datasets and establishes new state of the art results in most scenarios in terms of human pose estimation. Our approach achieves a noteworthy reduction of pose estimation mean per joint position error (MPJPE) by 35.6% and PA-MPJPE by 14.1% from the current state of the art benchmark within this domain.
4. 【2603.08540】PCFEx: Point Cloud Feature Extraction for Graph Neural Networks
链接:https://arxiv.org/abs/2603.08540
作者:Abdullah Al Masud,Shi Xintong,Mondher Bouazizi,Ohtsuki Tomoaki
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:gained significant attention, Graph neural networks, point cloud, neural networks, gained significant
备注: ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:Graph neural networks (GNNs) have gained significant attention for their effectiveness across various domains. This study focuses on applying GNN to process 3D point cloud data for human pose estimation (HPE) and human activity recognition (HAR). We propose novel point cloud feature extraction (PCFEx) techniques to capture meaningful information at the point, edge, and graph levels of the point cloud by considering point cloud as a graph. Moreover, we introduce a GNN architecture designed to efficiently process these features. Our approach is evaluated on four most popular publicly available millimeter wave radar datasets, three for HPE and one for HAR. The results show substantial improvements, with significantly reduced errors in all three HPE benchmarks, and an overall accuracy of 98.8% in mmWave-based HAR, outperforming the existing state of the art models. This work demonstrates the great potential of feature extraction incorporated with GNN modeling approach to enhance the precision of point cloud processing.
5. 【2603.08429】One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States
链接:https://arxiv.org/abs/2603.08429
作者:Bo Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:retrieve external knowledge, external knowledge typically, knowledge typically generate, separate embedding model, query as text
备注:
点击查看摘要
Abstract:LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.
6. 【2603.08341】ERASE -- A Real-World Aligned Benchmark for Unlearning in Recommender Systems
链接:https://arxiv.org/abs/2603.08341
作者:Pierre Lubitzsch,Maarten de Rijke,Sebastian Schelter
类目:Information Retrieval (cs.IR)
关键词:address privacy compliance, selected training data, Machine unlearning, recommender systems, privacy compliance
备注:
点击查看摘要
Abstract:Machine unlearning (MU) enables the removal of selected training data from trained models, to address privacy compliance, security, and liability issues in recommender systems. Existing MU benchmarks poorly reflect real-world recommender settings: they focus primarily on collaborative filtering, assume unrealistically large deletion requests, and overlook practical constraints such as sequential unlearning and efficiency. We present ERASE, a large-scale benchmark for MU in recommender systems designed to align with real-world usage. ERASE spans three core tasks -- collaborative filtering, session-based recommendation, and next-basket recommendation -- and includes unlearning scenarios inspired by real-world applications, such as sequentially removing sensitive interactions or spam. The benchmark covers seven unlearning algorithms, including general-purpose and recommender-specific methods, across nine public datasets and nine state-of-the-art models. We execute ERASE to produce more than 600 GB of reusable artifacts, such as extensive experimental logs and more than a thousand model checkpoints. Crucially, the artifacts that we release enable systematic analysis of where current unlearning methods succeed and where they fall short. ERASE showcases that approximate unlearning can match retraining in some settings, but robustness varies widely across datasets and architectures. Repeated unlearning exposes weaknesses in general-purpose methods, especially for attention-based and recurrent models, while recommender-specific approaches behave more reliably. ERASE provides the empirical foundation to help the community assess, drive, and track progress toward practical MU in recommender systems.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2603.08341 [cs.IR]
(or
arXiv:2603.08341v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.08341
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2603.08329】SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2603.08329
作者:Yagiz Can Akay,Muhammed Yusuf Kartal,Esra Alparslan,Faruk Ortakoyluoglu,Arda Akpinar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:requires synthesizing facts, synthesizing facts scattered, real-world queries, queries often requires, requires synthesizing
备注: 12 pages
点击查看摘要
Abstract:Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).
8. 【2603.08117】UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
链接:https://arxiv.org/abs/2603.08117
作者:Chang Liu,Chuqiao Kuang,Tianyi Zhuang,Yuxin Cheng,Huichi Zhou,Xiaoguang Li,Lifeng Shang
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Recent advancements, achieved record-breaking performance, Unindexed Information Seeking, advancements in LLM-based, achieved record-breaking
备注: 21 pages, 5 figures, ICLR 2026
点击查看摘要
Abstract:Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel multi-agent framework that incorporates dual-mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small $\sim$30B-parameter backbone LLM optimized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 27.27\%, outperforming systems integrating sophisticated LLMs such as O3 and GPT-4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information-seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information-seeking systems.
9. 【2603.08077】Why Large Language Models can Secretly Outperform Embedding Similarity in Information Retrieval
链接:https://arxiv.org/abs/2603.08077
作者:Matei Benescu,Ivo Pascal de Jong
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, Embedding Retrieval Systems, Neural Embedding Retrieval
备注: 13 pages, 6 figures, 5 tables
点击查看摘要
Abstract:With the emergence of Large Language Models (LLMs), new methods in Information Retrieval are available in which relevance is estimated directly through language understanding and reasoning, instead of embedding similarity. We argue that similarity is a short-sighted interpretation of relevance, and that LLM-Based Relevance Judgment Systems (LLM-RJS) (with reasoning) have potential to outperform Neural Embedding Retrieval Systems (NERS) by overcoming this limitation. Using the TREC-DL 2019 passage retrieval dataset, we compare various LLM-RJS with NERS, but observe no noticeable improvement. Subsequently, we analyze the impact of reasoning by comparing LLM-RJS with and without reasoning. We find that human annotations also suffer from short-sightedness, and that false-positives in the reasoning LLM-RJS are primarily mistakes in annotations due to short-sightedness. We conclude that LLM-RJS do have the ability to address the short-sightedness limitation in NERS, but that this cannot be evaluated with standard annotated relevance datasets.
10. 【2603.08012】Structure-Preserving Graph Contrastive Learning for Mathematical Information Retrieval
链接:https://arxiv.org/abs/2603.08012
作者:Chun-Hsi Ku,Hung-Hsuan Chen
类目:Information Retrieval (cs.IR); Digital Libraries (cs.DL)
关键词:introduces Variable Substitution, paper introduces Variable, graph contrastive learning, GCL augmentation techniques, Variable Substitution
备注:
点击查看摘要
Abstract:This paper introduces Variable Substitution as a domain-specific graph augmentation technique for graph contrastive learning (GCL) in the context of searching for mathematical formulas. Standard GCL augmentation techniques often distort the semantic meaning of mathematical formulas, particularly for small and highly structured graphs. Variable Substitution, on the other hand, preserves the core algebraic relationships and formula structure. To demonstrate the effectiveness of our technique, we apply it to a classic GCL-based retrieval model. Experiments show that this straightforward approach significantly improves retrieval performance compared to generic augmentation strategies. We release the code on GitHub.\footnote{this https URL}.
11. 【2603.07853】SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans
链接:https://arxiv.org/abs/2603.07853
作者:Hansi Zeng,Zoey Li,Yifan Gao,Chenwei Zhang,Xiaoman Pan,Tao Yang,Fengran Mo,Jiacheng Lin,Xian Li,Jingbo Shang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Research Agents enable, answer user queries, dynamically interleave internal, interleave internal reasoning, Agents enable models
备注:
点击查看摘要
Abstract:Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at this https URL.
12. 【2603.07725】Verifiable Reasoning for LLM-based Generative Recommendation
链接:https://arxiv.org/abs/2603.07725
作者:Xinyu Lin,Hanqing Zeng,Hanchao Yu,Yinglong Xia,Jiang Zhang,Aashu Singh,Fei Liu,Wenjie Wang,Fuli Feng,Tat-Seng Chua,Qifan Wang
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, recently shown strong, shown strong potential
备注:
点击查看摘要
Abstract:Reasoning in Large Language Models (LLMs) has recently shown strong potential in enhancing generative recommendation through deep understanding of complex user preference. Existing approaches follow a {reason-then-recommend} paradigm, where LLMs perform step-by-step reasoning before item generation. However, this paradigm inevitably suffers from reasoning degradation (i.e., homogeneous or error-accumulated reasoning) due to the lack of intermediate verification, thus undermining the recommendation. To bridge this gap, we propose a novel \textbf{\textit{reason-verify-recommend}} paradigm, which interleaves reasoning with verification to provide reliable feedback, guiding the reasoning process toward more faithful user preference understanding. To enable effective verification, we establish two key principles for verifier design: 1) reliability ensures accurate evaluation of reasoning correctness and informative guidance generation; and 2) multi-dimensionality emphasizes comprehensive verification across multi-dimensional user preferences. Accordingly, we propose an effective implementation called VRec. It employs a mixture of verifiers to ensure multi-dimensionality, while leveraging a proxy prediction objective to pursue reliability. Experiments on four real-world datasets demonstrate that VRec substantially enhances recommendation effectiveness and scalability without compromising efficiency. The codes can be found at this https URL.
13. 【2603.07605】Deep Research for Recommender Systems
链接:https://arxiv.org/abs/2603.07605
作者:Kesha Ou,Chenghao Wu,Xiaolei Wang,Bowen Zheng,Wayne Xin Zhao,Weitao Li,Long Zhang,Sheng Chen,Ji-Rong Wen
类目:Information Retrieval (cs.IR)
关键词:large language models, complex neural models, large language, neural models, language models
备注: 24 pages, 5 figures, 5 tables
点击查看摘要
Abstract:The technical foundations of recommender systems have progressed from collaborative filtering to complex neural models and, more recently, large language models. Despite these technological advances, deployed systems often underserve their users by simply presenting a list of items, leaving the burden of exploration, comparison, and synthesis entirely on the user. This paper argues that this traditional "tool-based" paradigm fundamentally limits user experience, as the system acts as a passive filter rather than an active assistant. To address this limitation, we propose a novel deep research paradigm for recommendation, which replaces conventional item lists with comprehensive, user-centric reports. We instantiate this paradigm through RecPilot, a multi-agent framework comprising two core components: a user trajectory simulation agent that autonomously explores the item space, and a self-evolving report generation agent that synthesizes the findings into a coherent, interpretable report tailored to support user decisions. This approach reframes recommendation as a proactive, agent-driven service. Extensive experiments on public datasets demonstrate that RecPilot not only achieves strong performance in modeling user behaviors but also generates highly persuasive reports that substantially reduce user effort in item evaluation, validating the potential of this new interaction paradigm.
14. 【2603.07517】GP-Tree: An in-memory spatial index combining adaptive grid cells with a prefix tree for efficient spatial querying
链接:https://arxiv.org/abs/2603.07517
作者:Xiangyang Yang,Xuefeng Guan,Lanxue Dang,Yi Xie,Qingyang Xu,Huayi Wu,Jiayao Wang
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Efficient spatial indexing, processing large-scale spatial, Efficient spatial, spatial, spatial indexes
备注:
点击查看摘要
Abstract:Efficient spatial indexing is crucial for processing large-scale spatial data. Traditional spatial indexes, such as STR-Tree and Quad-Tree, organize spatial objects based on coarse approximations, such as their minimum bounding rectangles (MBRs). However, this coarse representation is inadequate for complex spatial objects (e.g., district boundaries and trajectories), limiting filtering accuracy and query performance of spatial indexes. To address these limitations, we propose GP-Tree, a fine-grained spatial index that organizes approximated grid cells of spatial objects into a prefix tree structure. GP-Tree enhances filtering ability by replacing coarse MBRs with fine-grained cell-based approximations of spatial objects. The prefix tree structure optimizes data organization and query efficiency by leveraging the shared prefixes in the hierarchical grid cell encodings between parent and child cells. Additionally, we introduce optimization strategies, including tree pruning and node optimization, to reduce search paths and memory consumption, further enhancing GP-Tree's performance. Finally, we implement a variety of spatial query operations on GP-Tree, including range queries, distance queries, and k-nearest neighbor queries. Extensive experiments on real-world datasets demonstrate that GP-Tree significantly outperforms traditional spatial indexes, achieving up to an order-of-magnitude improvement in query efficiency.
15. 【2603.07502】SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration
链接:https://arxiv.org/abs/2603.07502
作者:Kan Ling,Zhen Qin,Yichi Zhu,Hengrun Zhang,Huiqun Yu,Guisheng Fan
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:cross-source data discovery, posing significant challenges, open data platforms, fragmented dataset ecosystem, open data
备注: 16 pages, 8 figures. System for large-scale dataset discovery and multi-entity semantic exploration
点击查看摘要
Abstract:The continuous expansion of open data platforms and research repositories has led to a fragmented dataset ecosystem, posing significant challenges for cross-source data discovery and interpretation. To address these challenges, we introduce SeDa--a unified framework for dataset discovery, semantic annotation, and multi-entity augmented navigation. SeDa integrates more than 7.6 million datasets from over 200 platforms, spanning governmental, academic, and industrial domains. The framework first performs semantic extraction and standardization to harmonize heterogeneous metadata representations. On this basis, a topic-tagging mechanism constructs an extensible tag graph that supports thematic retrieval and cross-domain association, while a provenance assurance module embedded within the annotation process continuously validates dataset sources and monitors link availability to ensure reliability and traceability. Furthermore, SeDa employs a multi-entity augmented navigation strategy that organizes datasets within a knowledge space of sites, institutions, and enterprises, enabling contextual and provenance-aware exploration beyond traditional search paradigms. Comparative experiments with popular dataset search platforms, such as ChatPD and Google Dataset Search, demonstrate that SeDa achieves superior coverage, timeliness, and traceability. Taken together, SeDa establishes a foundation for trustworthy, semantically enriched, and globally scalable dataset exploration.
16. 【2603.07449】Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System
链接:https://arxiv.org/abs/2603.07449
作者:Xiang Zhang,Hongming Xu,Le Zhou,Wei Zhou,Xuanhe Zhou,Guoliang Li,Yuyu Luo,Changdong Liu,Guorun Chen,Jiang Liao,Fan Wu
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Enterprises commonly deploy, distinct SQL dialect, commonly deploy heterogeneous, Enterprises commonly, deploy heterogeneous database
备注:
点击查看摘要
Abstract:Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built-in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt-based approaches tightly couple intent reasoning with dialect syntax, rule-based translators often degrade native operators into generic constructs, and multi-dialect fine-tuning suffers from cross-dialect interference. In this paper, we present Dial, a knowledge-grounded framework for dialect-specific NL2SQL. Dial introduces: (1) a Dialect-Aware Logical Query Planning module that converts natural language into a dialect-aware logical query plan via operator-level intent decomposition and divergence-aware specification; (2) HINT-KB, a hierarchical intent-aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution-driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state-of-the-art baselines. The code is at this https URL.
Subjects:
Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2603.07449 [cs.DB]
(or
arXiv:2603.07449v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2603.07449
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2603.07379】SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
链接:https://arxiv.org/abs/2603.07379
作者:Saroj Mishra,Suman Niroula,Umesh Yadav,Dilip Thakur,Srijan Gyawali,Shiva Gaire
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:coordinate multi-step reasoning, large language models, language models autonomously, models autonomously coordinate, autonomously coordinate multi-step
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies. Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision-making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved reliability risks. This Systematization of Knowledge (SoK) paper provides the first unified framework for understanding these autonomous systems. We formalize agentic retrieval-generation loops as finite-horizon partially observable Markov decision processes, explicitly modeling their control policies and state transitions. Building upon this formalization, we develop a comprehensive taxonomy and modular architectural decomposition that categorizes systems by their planning mechanisms, retrieval orchestration, memory paradigms, and tool-invocation behaviors. We further analyze the critical limitations of traditional static evaluation practices and identify severe systemic risks inherent to autonomous loops, including compounding hallucination propagation, memory poisoning, retrieval misalignment, and cascading tool-execution vulnerabilities. Finally, we outline key doctoral-scale research directions spanning stable adaptive retrieval, cost-aware orchestration, formal trajectory evaluation, and oversight mechanisms, providing a definitive roadmap for building reliable, controllable, and scalable agentic retrieval systems.
18. 【2603.07287】Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes
链接:https://arxiv.org/abs/2603.07287
作者:Chen Zhao,Yuan Tang,Yitian Qian
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:support software engineering, draft academic text, hallucinate bibliographic references, evidence synthesis, software engineering
备注:
点击查看摘要
Abstract:LLMs are increasingly used to draft academic text and to support software engineering (SE) evidence synthesis, but they often hallucinate bibliographic references that look legitimate. We study how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting. Using 144 claims (24 in SECS) and a deterministic verification pipeline (Crossref + Semantic Scholar), we evaluate two proprietary models (Claude Sonnet, GPT-4o) and two open-weight models (LLaMA~3.1-8B, Qwen~2.5-14B) across five regimes: Baseline, Temporal (publication-year window), Survey-style breadth, Non-Disclosure policy, and their combination. Across 17,443 generated citations, no model exceeds a citation-level existence rate of 0.475; Temporal and Combo conditions produce the steepest drops while outputs remain format-compliant (well-formed bibliographic fields). Unresolved outcomes dominate (36-61%); a 100-citation audit indicates that a substantial fraction of Unresolved cases are fabricated. Results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.
19. 【2603.07271】AutoDataset: A Lightweight System for Continuous Dataset Discovery and Search
链接:https://arxiv.org/abs/2603.07271
作者:Junzhe Yang,Xinghao Chen,Yunuo Liu,Zhijing Sun,Wenjin Guo,Xiaoyu Shen
类目:Information Retrieval (cs.IR)
关键词:machine learning, continuous expansion, expansion of task-specific, major driver, driver of progress
备注: 10 pages, 4 figures. Source code is available at [this https URL](https://github.com/EIT-NLP/AutoDataset) . Screencast video: [this https URL](https://youtu.be/_QSxHKyIYns)
点击查看摘要
Abstract:The continuous expansion of task-specific datasets has become a major driver of progress in machine learning. However, discovering newly released datasets remains difficult, as existing platforms largely depend on manual curation or community submissions, leading to limited coverage and substantial delays. To address this challenge, we introduce AutoDataset, a lightweight, automated system for real-time dataset discovery and retrieval. AutoDataset adopts a paper-first approach by continuously monitoring arXiv to detect and index datasets directly from newly published research. The system operates through a low-overhead multi-stage pipeline. First, a lightweight classifier rapidly filters titles and abstracts to identify papers releasing datasets, achieving an F1 score of 0.94 with an inference latency of 11 ms. For identified papers, we parse PDFs with GROBID and apply a sentence-level extractor to extract dataset descriptions. Dataset URLs are extracted from the paper text with an automated fallback to LaTeX source analysis when needed. Finally, the structured records are indexed using a dense semantic retriever, enabling low-latency natural language search. We deploy AutoDataset as a live system that continuously ingests new papers and provides up-to-date dataset discovery. In practice, it has been shown to significantly reduce the time required for researchers to locate newly released datasets, improving dataset discovery efficiency by up to 80%.
20. 【2603.07241】Rethinking Deep Research from the Perspective of Web Content Distribution Matching
链接:https://arxiv.org/abs/2603.07241
作者:Zixuan Yu,Zhenheng Tang,Tongliang Liu,Chengqi Zhang,Xiaowen Chu,Bo Han
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:Deep Search Agents, Deep Search, web indexing structures, underlying web indexing, search tools
备注:
点击查看摘要
Abstract:Despite the integration of search tools, Deep Search Agents often suffer from a misalignment between reasoning-driven queries and the underlying web indexing structures. Existing frameworks treat the search engine as a static utility, leading to queries that are either too coarse or too granular to retrieve precise evidence. We propose WeDas, a Web Content Distribution Aware framework that incorporates search-space structural characteristics into the agent's observation space. Central to our method is the Query-Result Alignment Score, a metric quantifying the compatibility between agent intent and retrieval outcomes. To overcome the intractability of indexing the dynamic web, we introduce a few-shot probing mechanism that iteratively estimates this score via limited query accesses, allowing the agent to dynamically recalibrate sub-goals based on the local content landscape. As a plug-and-play module, WeDas consistently improves sub-goal completion and accuracy across four benchmarks, effectively bridging the gap between high-level reasoning and low-level retrieval.
21. 【2603.07233】Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation
链接:https://arxiv.org/abs/2603.07233
作者:Andrea Giuseppe Di Francesco,Andrea Rubbi,Pietro Liò
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:understanding gene function, disease mechanisms, therapeutic development, respond to genetic, fundamental to understanding
备注: Accepted at ICLR 2026 Workshop: Generative AI in Genomics. 25 pages, 9 figures
点击查看摘要
Abstract:Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single-cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), a novel framework that extends Retrieval-Augmented Generation beyond traditional language-model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre-trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT-RAG addresses this through a two-stage pipeline: first, retrieving candidate perturbations $K$ using GenePT embeddings, then adaptively refining the selection through Gumbel-Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell-type-aware differentiable retrieval enables end-to-end optimization of the retrieval objective jointly with generation. On the Replogle-Nadig single-gene perturbation dataset, we demonstrate that PT-RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics ($W_1$, $W_2$). Notably, vanilla RAG's dramatic failure is itself a key finding: it demonstrates that differentiable, cell-type-aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval-augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at this https URL.
22. 【2603.07204】Detecting Cryptographically Relevant Software Packages with Collaborative LLMs
链接:https://arxiv.org/abs/2603.07204
作者:Eduard Hirsch,Kristina Raab,Tobias J. Bauer,Daniel Loebenberger
类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:including advanced persistent, advanced persistent attacks, future quantum-computing vulnerabilities, security threats, including advanced
备注: published at ICISSP ( [this https URL](https://icissp.scitevents.org/) )
点击查看摘要
Abstract:IT systems are facing an increasing number of security threats, including advanced persistent attacks and future quantum-computing vulnerabilities. The move towards crypto-agility and post-quantum cryptography (PQC) requires a reliable inventory of cryptographic assets across heterogeneous IT environments. Due to the sheer amount of packets, it is infeasible to manually detect cryptographically relevant software. Further, static code analysis pipelines often fail to address the diversity of modern ecosystems. Our research explores the use of large language models (LLMs) as heuristic tools for cryptographic asset discovery. We propose a collaborative framework that employs multiple LLMs to assess software relevance and aggregates their outputs through majority voting. To preserve data privacy, the approach operates on-premises without reliance on external servers. Using over 65,000 Fedora Linux packages, we evaluate the reliability of this method through statistical analysis, inter-model agreement, and manual validation. Preliminary results suggest that~LLM ensembles can serve as an efficient first-pass filter for identifying cryptographic software, resulting in reduced manual workload and assisting PQC transition. The study also compares on-premises and online LLM configurations, highlighting key advantages, limitations, and future directions for automated cryptographic asset discovery.
23. 【2603.07179】Retrieving Minimal and Sufficient Reasoning Subgraphs with Graph Foundation Models for Path-aware GraphRAG
链接:https://arxiv.org/abs/2603.07179
作者:Haonan Yuan,Qingyun Sun,Junhua Shi,Mingjun Liu,Jiaqi Yuan,Ziwei Zhang,Xingcheng Fu,Jianxin Li
类目:Information Retrieval (cs.IR)
关键词:exploits structured knowledge, Graph-based retrieval-augmented generation, Graph-based retrieval-augmented, exploits structured, structured knowledge
备注:
点击查看摘要
Abstract:Graph-based retrieval-augmented generation (GraphRAG) exploits structured knowledge to support knowledge-intensive reasoning. However, most existing methods treat graphs as intermediate artifacts, and the few subgraph-based retrieval methods depend on heuristic rules coupled with domain-specific distributions. They fail in typical cold-start scenarios where data in target domains is scarce, thus yielding reasoning contexts that are either informationally incomplete or structurally redundant. In this work, we revisit retrieval from a structural perspective, and propose GFM-Retriever that directly responds to user queries with a subgraph, where a pre-trained Graph Foundation Model acts as a cross-domain Retriever for multi-hop path-aware reasoning. Building on this perspective, we repurpose a pre-trained GFM from an entity ranking function into a generalized retriever to support cross-domain retrieval. On top of the retrieved graph, we further derive a label-free subgraph selector optimized by a principled Information Bottleneck objective to identify the query-conditioned subgraph, which contains informationally sufficient and structurally minimal golden evidence in a self-contained "core set". To connect structure with generation, we explicitly extract and reorganize relational paths as in-context prompts, enabling interpretable reasoning. Extensive experiments on multi-hop question answering benchmarks demonstrate that GFM-Retriever achieves state-of-the-art performance in both retrieval quality and answer generation, while maintaining efficiency.
24. 【2603.07146】Fine-Grained Table Retrieval Through the Lens of Complex Queries
链接:https://arxiv.org/abs/2603.07146
作者:Wojciech Kosiuk,Xingyu Ji,Yeounoh Chung,Fatma Özcan,Madelon Hulsebos
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
关键词:Enabling question answering, tabular data sources, Enabling question, natural language, natural language query
备注:
点击查看摘要
Abstract:Enabling question answering over tables and databases in natural language has become a key capability in the democratization of insights from tabular data sources. These systems first require retrieval of data that is relevant to a given natural language query, for which several methods have been introduced. In this work we present and study a table retrieval mechanism devising fine-grained typed query decomposition and global connectivity-awareness (DCTR), to handle the challenges induced by open-domain question answering over relational databases in complex usage contexts. We evaluate the effectiveness of the two mechanisms through the lens of retrieval complexity which we measure along the axes of query- and data complexity. Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
25. 【2603.07107】Efficient Personalized Reranking with Semi-Autoregressive Generation and Online Knowledge Distillation
链接:https://arxiv.org/abs/2603.07107
作者:Kai Cheng,Hao Wang,Wei Guo,Weiwen Liu,Yong Liu,Yawen Li,Enhong Chen
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:multi-stage recommender systems, Generative models offer, capture inter-item dependencies, final stage reranking, recommender systems
备注:
点击查看摘要
Abstract:Generative models offer a promising paradigm for the final stage reranking in multi-stage recommender systems, with the ability to capture inter-item dependencies within reranked lists. However, their practical deployment still faces two key challenges: (1) an inherent conflict between achieving high generation quality and ensuring low-latency inference, making it difficult to balance the two, and (2) insufficient interaction between user and item features in existing methods. To address these challenges, we propose a novel Personalized Semi-Autoregressive with online knowledge Distillation (PSAD) framework for reranking. In this framework, the teacher model adopts a semi-autoregressive generator to balance generation quality and efficiency, while its ranking knowledge is distilled online into a lightweight scoring network during joint training, enabling real-time and efficient inference. Furthermore, we propose a User Profile Network (UPN) that injects user intent and models interest dynamics, enabling deeper interactions between users and items. Extensive experiments conducted on three large-scale public datasets demonstrate that PSAD significantly outperforms state-of-the-art baselines in both ranking performance and inference efficiency.
26. 【2603.07086】Multi-TAP: Multi-criteria Target Adaptive Persona Modeling for Cross-Domain Recommendation
链接:https://arxiv.org/abs/2603.07086
作者:Daehee Kang,Yeon-Chang Lee
类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:alleviate data sparsity, coarse-grained behavioral signals, existing methods primarily, methods primarily rely, aims to alleviate
备注:
点击查看摘要
Abstract:Cross-domain recommendation (CDR) aims to alleviate data sparsity by transferring knowledge across domains, yet existing methods primarily rely on coarse-grained behavioral signals and often overlook intra-domain heterogeneity in user preferences. We propose Multi-TAP, a multi-criteria target-adaptive persona framework that explicitly captures such heterogeneity through semantic persona modeling. To enable effective transfer, Multi-TAP selectively incorporates source-domain signals conditioned on the target domain, preserving relevance during knowledge transfer. Experiments on real-world datasets demonstrate that Multi-TAP consistently outperforms state-of-the-art CDR methods, highlighting the importance of modeling intra-domain heterogeneity for robust cross-domain recommendation. The codebase of Multi-TAP is currently available at this https URL.
27. 【2603.07050】Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases
链接:https://arxiv.org/abs/2603.07050
作者:Nikita Gautam,Doina Caragea,Ignacio Ciampitti,Federico Gomez
类目:Information Retrieval (cs.IR)
关键词:online scientific literature, Large Language Models, scientific databases, domain-specific scientific literature, open scientific databases
备注:
点击查看摘要
Abstract:With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified dataset. The dataset is subsequently filtered using LLMs queried with prompts tailored for each keyword-based query to extract the relevant data to a scientific query of interest. The approach was tested across a set of variable keyword-based searches for different domain-specific tasks related to agriculture and crop yield. The results and analysis show 90\% overlap with small domain expert-curated databases, suggesting that the proposed tool can be used to significantly reduce manual workload. Furthermore, the proposed framework is both scalable and domain-agnostic and can be applied across diverse fields for building scalable open scientific databases.
28. 【2603.06982】Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning
链接:https://arxiv.org/abs/2603.06982
作者:Paul Julius Kühn,Cedric Spengler,Michael Weinmann,Arjan Kuijper,Saptarshi Neil Sinha
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Image-based shape retrieval, Image-based shape, computer vision, computer graphics, aims to retrieve
备注:
点击查看摘要
Abstract:Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on $Acc_{Top1}$ and $Acc_{Top10}$ for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.
29. 【2603.06660】Approximate Nearest Neighbor Search for Modern AI: A Projection-Augmented Graph Approach
链接:https://arxiv.org/abs/2603.06660
作者:Kejing Lu,Zhenpeng Pan,Jianbin Qin,Yoshiharu Ishikawa,Chuan Xiao
类目:Information Retrieval (cs.IR); Databases (cs.DB); Machine Learning (cs.LG)
关键词:Nearest Neighbor Search, Approximate Nearest Neighbor, Neighbor Search, Nearest Neighbor, Approximate Nearest
备注: Source code is available at [this https URL](https://github.com/KejingLu-810/PAG/)
点击查看摘要
Abstract:Approximate Nearest Neighbor Search (ANNS) is fundamental to modern AI applications. Most existing solutions optimize query efficiency but fail to align with the practical requirements of modern workloads. In this paper, we outline six critical demands of modern AI applications: high query efficiency, fast indexing, low memory footprint, scalability to high dimensionality, robustness across varying retrieval sizes, and support for online insertions. To satisfy all these demands, we introduce Projection-Augmented Graph (PAG), a new ANNS framework that integrates projection techniques into a graph index. PAG reduces unnecessary exact distance computations through asymmetric comparisons between exact and approximate distances as guided by projection-based statistical tests. Three key components are designed and unified to the graph index to optimize indexing and searching. Experiments on six modern datasets demonstrate that PAG consistently achieves superior query per second (QPS)-recall performance -- up to 5x faster than HNSW -- while offering fast indexing speed and moderate memory footprint. PAG remains robust as dimensionality and retrieval size increase and naturally supports online insertions.
30. 【2603.06631】-REX: Transformer-Based Category Sequence Generation for Grocery Basket Recommendation
链接:https://arxiv.org/abs/2603.06631
作者:Soroush Mokhtari,Muhammad Tayyab Asif,Sergiy Zubatiy
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:shopping presents unique, complex item relationships, presents unique challenges, repetitive purchase patterns, presents unique
备注:
点击查看摘要
Abstract:Online grocery shopping presents unique challenges for sequential recommendations due to repetitive purchase patterns and complex item relationships within the baskets. Unlike traditional e-commerce, grocery recommendations must capture both complementary item associations and temporal dependencies across shopping sessions. To address these challenges in Amazon's online grocery business, we propose T-REX, a novel transformer architecture that generates personalized category-level suggestions by learning both short-term basket dependencies and long-term user preferences. Our approach introduces three key innovations: (1) an efficient sampling strategy utilizing dynamic sequence splitting for sparse shopping patterns, (2) an adaptive positional encoding scheme for temporal patterns, and (3) a category-level modeling approach that reduces dimensionality while maintaining recommendation quality. Although masked language modeling techniques like BERT4Rec excel at capturing item relations, they prove less suitable for next basket generation due to information leakage issues. In contrast, T-REX's causal masking approach better aligns with the sequential nature of basket generation, enabling more accurate next-basket predictions. Experiments on large-scale grocery offline data and online A/B tests show significant improvement over existing systems.
31. 【2603.06624】Exploration Space Theory: Formal Foundations for Prerequisite-Aware Location-Based Recommendation
链接:https://arxiv.org/abs/2603.06624
作者:Madjid Sadallah
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
关键词:achieved considerable sophistication, presupposes contextual knowledge, contextual knowledge gained, locations presupposes contextual, Knowledge Space Theory
备注: Pre-print of a theoretical framework for prerequisite-aware recommendation using Knowledge Space Theory and Birkhoff representation
点击查看摘要
Abstract:Location-based recommender systems have achieved considerable sophistication, yet none provides a formal, lattice-theoretic representation of prerequisite dependencies among points of interest -- the semantic reality that meaningfully experiencing certain locations presupposes contextual knowledge gained from others -- nor the structural guarantees that such a representation entails. We introduce Exploration Space Theory (EST), a formal framework that transposes Knowledge Space Theory into location-based recommendation. We prove that the valid user exploration states -- the order ideals of a surmise partial order on points of interest -- form a finite distributive lattice and a well-graded learning space; Birkhoff's representation theorem, combined with the structural isomorphism between lattices of order ideals and concept lattices, connects the exploration space canonically to Formal Concept Analysis. These structural results yield four direct consequences: linear-time fringe computation, a validity certificate guaranteeing that every fringe-guided recommendation is a structurally sound next step, sub-path optimality for dynamic-programming path generation, and provably existing structural explanations for every recommendation. Building on these foundations, we specify the Exploration Space Recommender System (ESRS) -- a memoized dynamic program over the exploration lattice, a Bayesian state estimator with beam approximation and EM parameter learning, an online feedback loop enforcing the downward-closure invariant, an incremental surmise-relation inference pipeline, and three cold-start strategies, the structural one being the only approach in the literature to provide a formal validity guarantee conditional on the correctness of the inferred surmise relation. All results are established through proof and illustrated on a fully traced five-POI numerical example.
32. 【2603.06589】Isotonic Layer: A Universal Framework for Generic Recommendation Debiasing
链接:https://arxiv.org/abs/2603.06589
作者:Hailing Cheng,Yafang Yang,Hemeng Tao,Fengyu Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:scale recommendation systems, large scale recommendation, debiasing are fundamental, reliability and fairness, fairness of large
备注: 8 pages, 5 figures, submitted to KDD 2026
点击查看摘要
Abstract:Model calibration and debiasing are fundamental to the reliability and fairness of large scale recommendation systems. We introduce the Isotonic Layer, a novel, differentiable framework that integrates piecewise linear fitting directly into neural architectures. By partitioning the feature space into discrete segments and optimizing non negative slopes via a constrained dot product mechanism, we enforce a global monotonic inductive bias. This ensures model outputs remain logically consistent with critical features such as latent relevance, recency, or quality scores. We further generalize this architecture by parameterizing segment wise slopes as learnable embeddings. This enables the model to adaptively capture context specific distortions, such as position based CTR bias through specialized isotonic profiles. Our approach utilizes a dual task formulation that decouples the recommendation objective into latent relevance estimation and bias aware calibration. A major contribution of this work is the ability to perform highly granular, customized calibration for arbitrary combinations of context features, a level of control difficult to achieve with traditional non parametric methods. We also extend this to Multi Task Learning environments with dedicated embeddings for distinct objectives. Extensive empirical evaluations on real world datasets and production AB tests demonstrate that the Isotonic Layer effectively mitigates systematic bias and enhances calibration fidelity, significantly outperforming production baselines in both predictive accuracy and ranking consistency.
33. 【2603.06586】Scaling Multilingual Semantic Search in Uber Eats Delivery
链接:https://arxiv.org/abs/2603.06586
作者:Bo Ling,Zheng Liu,Haoyang Chen,Divya Nagar,Luting Yang,Mehul Parsana
类目:Information Retrieval (cs.IR)
关键词:Uber Eats, Eats that unifies, production-oriented semantic retrieval, retail items, Matryoshka Representation Learning
备注: 15 pages, 11 tables, 1 figure. Planned for submission to SIGIR or KDD 2026
点击查看摘要
Abstract:We present a production-oriented semantic retrieval system for Uber Eats that unifies retrieval across stores, dishes, and grocery/retail items. Our approach fine-tunes a Qwen2 two-tower base model using hundreds of millions of query-document interactions that were aggregated and anonymized pretraining. We train the model with a combination of InfoNCE on in-batch negatives and triplet-NCE loss on hard negatives, and we leverage Matryoshka Representation Learning (MRL) to serve multiple embedding sizes from a single model. Our system achieves substantial recall gains over a strong baseline across six markets and three verticals. This paper presents the end to end work including data curation, model architecture, large-scale training, and evaluation. We also share key insights and practical lessons for building a unified, multilingual, and multi-vertical retrieval system for consumer search.
34. 【2603.06582】Agentic SPARQL: Evaluating SPARQL-MCP-powered Intelligent Agents on the Federated KGQA Benchmark
链接:https://arxiv.org/abs/2603.06582
作者:Daniel Dobriy,Frederik Bauer,Amr Azzam,Debayan Banerjee,Axel Polleres
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:Model Context Protocol, LLMs' planning capabilities, solve complex tasks, Knowledge Graph Question, Graph Question Answering
备注:
点击查看摘要
Abstract:Standard protocols such as the Model Context Protocol (MCP) that allow LLMs to connect to tools have recently boosted "agentic" AI applications, which, powered by LLMs' planning capabilities, promise to solve complex tasks with the access of external tools and data sources. In this context, publicly available SPARQL endpoints offer a natural connection to combine various data sources through MCP by (a) implementing a standardised protocol and query language, (b) standardised metadata formats, and (c) the native capability to federate queries. In the present paper, we explore the potential of SPARQL-MCP-based intelligent agents to facilitate federated SPARQL querying: firstly, we discuss how to extend an existing Knowledge Graph Question Answering benchmark towards agentic federated Knowledge Graph Question Answering (FKGQA); secondly, we implement and evaluate the ability of integrating SPARQL federation with LLM agents via MCP (incl. endpoint discovery/source selection, schema exploration, and query formulation), comparing different architectural options against the extended benchmark. Our work complements and extends prior work on automated SPARQL query federation towards fruitful combinations with agentic AI.
35. 【2603.08370】Unifying On- and Off-Policy Variance Reduction Methods
链接:https://arxiv.org/abs/2603.08370
作者:Olivier Jeunen
类目:Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
关键词:Continuous and efficient, efficient experimentation, experimentation is key, practical success, success of user-facing
备注:
点击查看摘要
Abstract:Continuous and efficient experimentation is key to the practical success of user-facing applications on the web, both through online A/B-tests and off-policy evaluation. Despite their shared objective -- estimating the incremental value of a treatment -- these domains often operate in isolation, utilising distinct terminologies and statistical toolkits. This paper bridges that divide by establishing a formal equivalence between their canonical variance reduction methods. We prove that the standard online Difference-in-Means estimator is mathematically identical to an off-policy Inverse Propensity Scoring estimator equipped with an optimal (variance-minimising) additive control variate. Extending this unification, we demonstrate that widespread regression adjustment methods (such as CUPED, CUPAC, and ML-RATE) are structurally equivalent to Doubly Robust estimation. This unified view extends our understanding of commonly used approaches, and can guide practitioners and researchers working on either class of problems.
Subjects:
Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
Cite as:
arXiv:2603.08370 [stat.ML]
(or
arXiv:2603.08370v1 [stat.ML] for this version)
https://doi.org/10.48550/arXiv.2603.08370
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
计算机视觉
1. 【2603.08709】Scale Space Diffusion
链接:https://arxiv.org/abs/2603.08709
作者:Soumik Mukhopadhyay,Prateksha Udhayanan,Abhinav Shrivastava
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Scale Space Diffusion, Diffusion models degrade, models degrade images, Scale Space, Space Diffusion
备注: Project website: [this https URL](https://prateksha.github.io/projects/scale-space-diffusion/) . The first two authors contributed equally
点击查看摘要
Abstract:Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths. Our project website ( this https URL ) is available publicly.
2. 【2603.08708】FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models
链接:https://arxiv.org/abs/2603.08708
作者:Haoyang Li,Liang Wang,Siyu Zhou,Jiacheng Sun,Jing Jiang,Chao Wang,Guodong Long,Yan Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables pretrained Vision-Language, tuning enables pretrained, CLIP-based prompt tuning, prompt tuning enables, pretrained Vision-Language Models
备注: 27 Pages, 9 Figures, 15 Tables
点击查看摘要
Abstract:CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: this https URL
3. 【2603.08703】HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising
链接:https://arxiv.org/abs/2603.08703
作者:Kai Zou,Dian Zheng,Hongbo Liu,Tiankai Hang,Bin Liu,Nenghai Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:theoretically infinite length, infinite length, offers a promising, generating videos, videos of theoretically
备注: Project page: [this https URL](https://jacky-hate.github.io/HiAR/) Code: [this https URL](https://github.com/Jacky-hate/HiAR)
点击查看摘要
Abstract:Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.
4. 【2603.08681】ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation
链接:https://arxiv.org/abs/2603.08681
作者:Nanjun Li,Pinqi Cheng,Zean Liu,Minghe Tian,Xuanyin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:jointly perform human, perform human localization, pose estimation, pose, estimation
备注:
点击查看摘要
Abstract:Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.
5. 【2603.08674】alking Together: Synthesizing Co-Located 3D Conversations from Audio
链接:https://arxiv.org/abs/2603.08674
作者:Mengyi Shan,Shouchieh Chang,Ziqian Bai,Shichen Liu,Yinda Zhang,Luchuan Song,Rohit Pandey,Sean Fanello,Zeng Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating complete, tackle the challenging, challenging task, task of generating, including relative position
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.
6. 【2603.08661】ImprovedGS+: A High-Performance C++/CUDA Re-Implementation Strategy for 3D Gaussian Splatting
链接:https://arxiv.org/abs/2603.08661
作者:Jordi Muñoz Vicente
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advancements, balancing reconstruction fidelity, Gaussian Splatting, computational efficiency, shifted the focus
备注: 6 pages, 1 figure. Technical Report. This work introduces ImprovedGS+, a library-free C++/CUDA implementation for 3D Gaussian Splatting within the LichtFeld-Studio framework. Source code available at [this https URL](https://github.com/jordizv/ImprovedGS-Plus)
点击查看摘要
Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have shifted the focus toward balancing reconstruction fidelity with computational efficiency. In this work, we propose ImprovedGS+, a high-performance, low-level reinvention of the ImprovedGS strategy, implemented natively within the LichtFeld-Studio framework. By transitioning from high-level Python logic to hardware-optimized C++/CUDA kernels, we achieve a significant reduction in host-device synchronization and training latency. Our implementation introduces a Long-Axis-Split (LAS) CUDA kernel, custom Laplacian-based importance kernels with Non-Maximum Suppression (NMS) for edge scores, and an adaptive Exponential Scale Scheduler. Experimental results on the Mip-NeRF360 dataset demonstrate that ImprovedGS+ establishes a new Pareto-optimal front for scene reconstruction. Our 1M-budget variant outperforms the state-of-the-art MCMC baseline by achieving a 26.8% reduction in training time (saving 17 minutes per session) and utilizing 13.3% fewer Gaussians while maintaining superior visual quality. Furthermore, our full variant demonstrates a 1.28 dB PSNR increase over the ADC baseline with a 38.4% reduction in parametric complexity. These results validate ImprovedGS+ as a scalable, high-speed solution that upholds the core pillars of Speed, Quality, and Usability within the LichtFeld-Studio ecosystem.
7. 【2603.08648】CAST: Modeling Visual State Transitions for Consistent Video Retrieval
链接:https://arxiv.org/abs/2603.08648
作者:Yanqing Liu,Yingcheng Liu,Fanghong Dong,Budianto Budianto,Cihang Xie,Yan Jiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:composing short clips, content creation shifts, video content creation, long-form narratives, composing short
备注:
点击查看摘要
Abstract:As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
8. 【2603.08645】Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization
链接:https://arxiv.org/abs/2603.08645
作者:Matan Levy,Gavriel Habib,Issar Tzachor,Dvir Samuel,Rami Ben-Ari,Nir Darshan,Or Litany,Dani Lischinski
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:hand-designed blendshape spaces, achieve high visual, learning expression-dependent facial, avoiding parametric face, parametric face templates
备注:
点击查看摘要
Abstract:Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.
9. 【2603.08639】UNBOX: Unveiling Black-box visual models with Natural-language
链接:https://arxiv.org/abs/2603.08639
作者:Simone Carnemolla,Chiara Russo,Simone Palazzo,Quentin Bouniot,Daniela Giordano,Zeynep Akata,Matteo Pennisi,Concetto Spampinato
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Ensuring trustworthiness, trustworthiness in open-world, recognition requires models, Ensuring, open-world visual recognition
备注: Under review at IJCV
点击查看摘要
Abstract:Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.
10. 【2603.08620】StreamReady: Learning What to Answer and When in Long Streaming Videos
链接:https://arxiv.org/abs/2603.08620
作者:Shehreen Azad,Vibhav Vineet,Yogesh Singh Rawat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduces real-time utility, involves time-sensitive scenarios, passed reduces real-time, Streaming video understanding, evidence reflects speculation
备注: Accepted in CVPR 2026
点击查看摘要
Abstract:Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.
11. 【2603.08611】FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection
链接:https://arxiv.org/abs/2603.08611
作者:Anqi Joyce Yang,James Tu,Nikita Dvornik,Enxu Li,Raquel Urtasun
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:vulnerable road users, complex traffic environments, traffic control devices, navigate complex traffic, semantic classes pertaining
备注: Published at 9th Annual Conference on Robot Learning (CoRL 2025)
点击查看摘要
Abstract:In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at this https URL.
12. 【2603.08605】Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation
链接:https://arxiv.org/abs/2603.08605
作者:Hikmat Khan,Wei Chen,Muhammad Khalid Khan Niazi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:histopathological grading depends, Background and objectives, cancer histopathological grading, histopathological grading, grading depends
备注:
点击查看摘要
Abstract:Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.08605 [cs.CV]
(or
arXiv:2603.08605v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.08605
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Hikmat Khan Ph.D [view email] [v1]
Mon, 9 Mar 2026 16:54:05 UTC (4,169 KB)
13. 【2603.08592】Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
链接:https://arxiv.org/abs/2603.08592
作者:Jiangye Yuan,Gowri Kumar,Baoyuan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, space remains limited, Multimodal Large
备注:
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
14. 【2603.08590】PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition
链接:https://arxiv.org/abs/2603.08590
作者:Zeyu Ling,Qing Shuai,Teng Zhang,Shiyang Li,Bo Han,Changqing Zou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advanced rapidly, generation, challenges persist, latent, latent space
备注:
点击查看摘要
Abstract:Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.08590 [cs.CV]
(or
arXiv:2603.08590v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.08590
Focus to learn more
arXiv-issued DOI via DataCite</p>
15. 【2603.08589】CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
链接:https://arxiv.org/abs/2603.08589
作者:Yucheng Wang,Zedong Wang,Yuetong Wu,Yue Ma,Dan Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unified diffusion editors, local vs global, Unified diffusion, shared backbone, heterogeneous demands
备注: Accepted by CVPR 2026. Project page: [this https URL](https://care-edit.github.io/)
点击查看摘要
Abstract:Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.
16. 【2603.08583】DualFlexKAN: Dual-stage Kolmogorov-Arnold Networks with Independent Function Control
链接:https://arxiv.org/abs/2603.08583
作者:Andrés Ortiz,Nicolás J. Gallego-Molina,Carmen Jiménez-Mesa,Juan M. Górriz,Javier Ramírez
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:static inductive bias, approximate complex topologies, complex topologies solely, Multi-Layer Perceptrons, rely on pre-defined
备注: 22 pages, 12 figures
点击查看摘要
Abstract:Multi-Layer Perceptrons (MLPs) rely on pre-defined, fixed activation functions, imposing a static inductive bias that forces the network to approximate complex topologies solely through increased depth and width. Kolmogorov-Arnold Networks (KANs) address this limitation through edge-centric learnable functions, yet their formulation suffers from quadratic parameter scaling and architectural rigidity that hinders the effective integration of standard regularization techniques. This paper introduces the DualFlexKAN (DFKAN), a flexible architecture featuring a dual-stage mechanism that independently controls pre-linear input transformations and post-linear output activations. This decoupling enables hybrid networks that optimize the trade-off between expressiveness and computational cost. Unlike standard formulations, DFKAN supports diverse basis function families, including orthogonal polynomials, B-splines, and radial basis functions, integrated with configurable regularization strategies that stabilize training dynamics. Comprehensive evaluations across regression benchmarks, physics-informed tasks, and function approximation demonstrate that DFKAN outperforms both MLPs and conventional KANs in accuracy, convergence speed, and gradient fidelity. The proposed hybrid configurations achieve superior performance with one to two orders of magnitude fewer parameters than standard KANs, effectively mitigating the parameter explosion problem while preserving KAN-style expressiveness. DFKAN provides a principled, scalable framework for incorporating adaptive non-linearities, proving particularly advantageous for data-efficient learning and interpretable function discovery in scientific applications.
17. 【2603.08582】Online Sparse Synthetic Aperture Radar Imaging
链接:https://arxiv.org/abs/2603.08582
作者:Conor Flynn,Radoslav Ivanov,Birsen Yazici
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fulfill mission objectives, modern defense applications, defense applications increasingly, applications increasingly relying, Synthetic Aperture Radar
备注: IEEE Radar Conference 2026
点击查看摘要
Abstract:With modern defense applications increasingly relying on inexpensive, autonomous drones, lies the major challenge of designing computationally and memory-efficient onboard algorithms to fulfill mission objectives. This challenge is particularly significant in Synthetic Aperture Radar (SAR), where large volumes of data must be collected and processed for downstream tasks. We propose an online reconstruction method, the Online Fast Iterative Shrinkage-Thresholding Algorithm (Online FISTA), which incrementally reconstructs a scene with limited data through sparse coding. Rather than requiring storage of all received signal data, the algorithm recursively updates storage matrices for each iteration, greatly reducing memory demands. Online SAR image reconstruction facilitates more complex downstream tasks, such as Automatic Target Recognition (ATR), in an online manner, resulting in a more versatile and integrated framework compared to existing post-collection reconstruction and ATR approaches.
18. 【2603.08564】BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment
链接:https://arxiv.org/abs/2603.08564
作者:Erdong Chen,Yuyang Ji,Jacob K. Greenberg,Benjamin Steel,Faraz Arkam,Abigail Lewis,Pranay Singh,Feng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video-based Clinical Gait, Clinical Gait Analysis, capturing pathological motion, overfit environmental biases, models overfit environmental
备注:
点击查看摘要
Abstract:Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.
19. 【2603.08551】mmGAT: Pose Estimation by Graph Attention with Mutual Features from mmWave Radar Point Cloud
链接:https://arxiv.org/abs/2603.08551
作者:Abdullah Al Masud,Shi Xintong,Mondher Bouazizi,Ohtsuki Tomoaki
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:pivotal technologies spanning, Pose estimation, human action recognition, human pose estimation, Graph Neural Network
备注: copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:Pose estimation and human action recognition (HAR) are pivotal technologies spanning various domains. While the image-based pose estimation and HAR are widely admired for their superior performance, they lack in privacy protection and suboptimal performance in low-light and dark environments. This paper exploits the capabilities of millimeter-wave (mmWave) radar technology for human pose estimation by processing radar data with Graph Neural Network (GNN) architecture, coupled with the attention mechanism. Our goal is to capture the finer details of the radar point cloud to improve the pose estimation performance. To this end, we present a unique feature extraction technique that exploits the full potential of the GNN processing method for pose estimation. Our model mmGAT demonstrates remarkable performance on two publicly available benchmark mmWave datasets and establishes new state of the art results in most scenarios in terms of human pose estimation. Our approach achieves a noteworthy reduction of pose estimation mean per joint position error (MPJPE) by 35.6% and PA-MPJPE by 14.1% from the current state of the art benchmark within this domain.
20. 【2603.08546】Interactive World Simulator for Robot Policy Training and Evaluation
链接:https://arxiv.org/abs/2603.08546
作者:Yixuan Wang,Rhythm Syed,Fangyu Wu,Mengchao Zhang,Aykut Onol,Jose Barreiros,Hooshang Nayyeri,Tony Dear,Huan Zhang,Yunzhu Li
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Action-conditioned video prediction, Action-conditioned video, Interactive World Simulator, world models, shown strong potential
备注: Project Page: [this https URL](https://yixuanwang.me/interactive_world_sim)
点击查看摘要
Abstract:Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
21. 【2603.08540】PCFEx: Point Cloud Feature Extraction for Graph Neural Networks
链接:https://arxiv.org/abs/2603.08540
作者:Abdullah Al Masud,Shi Xintong,Mondher Bouazizi,Ohtsuki Tomoaki
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:gained significant attention, Graph neural networks, point cloud, neural networks, gained significant
备注: ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:Graph neural networks (GNNs) have gained significant attention for their effectiveness across various domains. This study focuses on applying GNN to process 3D point cloud data for human pose estimation (HPE) and human activity recognition (HAR). We propose novel point cloud feature extraction (PCFEx) techniques to capture meaningful information at the point, edge, and graph levels of the point cloud by considering point cloud as a graph. Moreover, we introduce a GNN architecture designed to efficiently process these features. Our approach is evaluated on four most popular publicly available millimeter wave radar datasets, three for HPE and one for HAR. The results show substantial improvements, with significantly reduced errors in all three HPE benchmarks, and an overall accuracy of 98.8% in mmWave-based HAR, outperforming the existing state of the art models. This work demonstrates the great potential of feature extraction incorporated with GNN modeling approach to enhance the precision of point cloud processing.
22. 【2603.08536】SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
链接:https://arxiv.org/abs/2603.08536
作者:Chao Wang,Zijin Yang,Yaofei Wang,Yuang Qi,Weiming Zhang,Nenghai Yu,Kejiang Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advancements, multiple domains, widespread application, application across multiple, video
备注:
点击查看摘要
Abstract:Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at this https URL.
23. 【2603.08533】SecAgent: Efficient Mobile GUI Agent with Semantic Context
链接:https://arxiv.org/abs/2603.08533
作者:Yiping Xie,Song Chen,Jingxuan Xing,Wei Jiang,Zekun Zhu,Yingyao Wang,Pi Bu,Jun Song,Yuning Jiang,Bo Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Graphical User Interface, Mobile Graphical User, User Interface, Graphical User, complex smartphone tasks
备注:
点击查看摘要
Abstract:Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.
24. 【2603.08523】BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images
链接:https://arxiv.org/abs/2603.08523
作者:Sinan U. Ulu,A. Enes Doruk,I. Can Yagmur,Bahadir K. Gunturk,Oguz Hanoglu,Hasan F. Ates
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate building segmentation, single-view RGB satellite, RGB satellite imagery, remain ill-posed due, Accurate building
备注:
点击查看摘要
Abstract:Accurate building segmentation and height estimation from single-view RGB satellite imagery are fundamental for urban analytics, yet remain ill-posed due to structural variability and the high computational cost of global context modeling. While current approaches typically adapt monocular depth architectures, they often suffer from boundary bleeding and systematic underestimation of high-rise structures. To address these limitations, we propose BuildMamba, a unified multi-task framework designed to exploit the linear-time global modeling of visual state-space models. Motivated by the need for stronger structural coupling and computational efficiency, we introduce three modules: a Mamba Attention Module for dynamic spatial recalibration, a Spatial-Aware Mamba-FPN for multi-scale feature aggregation via gated state-space scans, and a Mask-Aware Height Refinement module using semantic priors to suppress height artifacts. Extensive experiments demonstrate that BuildMamba establishes a new performance upper bound across three benchmarks. Specifically, it achieves an IoU of 0.93 and RMSE of 1.77~m on DFC23 benchmark, surpassing state-of-the-art by 0.82~m in height estimation. Simulation results confirm the model's superior robustness and scalability for large-scale 3D urban reconstruction.
25. 【2603.08521】OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras
链接:https://arxiv.org/abs/2603.08521
作者:Yongzhi Lin,Kai Luo,Yuanfan Zheng,Hao Shi,Mengfei Duan,Yang Liu,Kailun Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:temporally consistent manner, Understanding dynamic, panoptic occupancy tracking, occupancy tracking, autonomous driving
备注: The benchmark and source code will be made publicly available at [this https URL](https://github.com/YouthZest-Lin/OccTrack360)
点击查看摘要
Abstract:Understanding dynamic 3D environments in a spatially continuous and temporally consistent manner is fundamental for robotics and autonomous driving. While recent advances in occupancy prediction provide a unified representation of scene geometry and semantics, progress in 4D panoptic occupancy tracking remains limited by the lack of benchmarks that support surround-view fisheye sensing, long temporal sequences, and instance-level voxel tracking. To address this gap, we present OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras. OccTrack360 provides substantially longer and more diverse sequences (174~2234 frames) than prior benchmarks, together with principled voxel visibility annotations, including an all-direction occlusion mask and an MEI-based fisheye field-of-view mask. To establish a strong fisheye-oriented baseline, we further propose Focus on Sphere Occ (FoSOcc), a framework that addresses two core challenges in fisheye occupancy tracking: distorted spherical projection and inaccurate voxel-space localization. FoSOcc includes a Center Focusing Module (CFM) to enhance instance-aware spatial localization through supervised focus guidance, and a Spherical Lift Module (SLM) that extends perspective lifting to fisheye imaging under the Unified Projection Model. Extensive experiments on Occ3D-Waymo and OccTrack360 show that our method improves occupancy tracking quality with notable gains on geometrically regular categories, and establishes a strong baseline for future research on surround-view fisheye 4D occupancy tracking. The benchmark and source code will be made publicly available at this https URL.
26. 【2603.08514】Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection
链接:https://arxiv.org/abs/2603.08514
作者:Shoumeng Qiu,Xinrun Li,Yang Long
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent DEtection TRansformer, achieved remarkable success, Recent DEtection, DEtection TRansformer, based frameworks
备注:
点击查看摘要
Abstract:Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.
27. 【2603.08503】Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction
链接:https://arxiv.org/abs/2603.08503
作者:Zhe Yang,Guoqiang Zhao,Sheng Wu,Kai Luo,Kailun Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:Gaussian Opacity Fields, images are increasingly, vision due, omnidirectional Gaussian rendering, wide field
备注: The source code and dataset will be released at [this https URL](https://github.com/1170632760/Spherical-GOF)
点击查看摘要
Abstract:Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at this https URL.
28. 【2603.08499】Improving Continual Learning for Gaussian Splatting based Environments Reconstruction on Commercial Off-the-Shelf Edge Devices
链接:https://arxiv.org/abs/2603.08499
作者:Ivan Zaino,Matteo Risso,Daniele Jahier Pagliari,Miguel de Prado,Toon Van de Maele,Alessio Burrello
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Bayesian Gaussian Splatting, needed for SLAM, Variational Bayesian Gaussian, view synthesis, edge robotics
备注:
点击查看摘要
Abstract:Novel view synthesis (NVS) is increasingly relevant for edge robotics, where compact and incrementally updatable 3D scene models are needed for SLAM, navigation, and inspection under tight memory and latency budgets. Variational Bayesian Gaussian Splatting (VBGS) enables replay-free continual updates for the 3DGS algorithm by maintaining a probabilistic scene model, but its high-precision computations and large intermediate tensors make on-device training impractical. We present a precision-adaptive optimization framework that enables VBGS training on resource-constrained hardware without altering its variational formulation. We (i) profile VBGS to identify memory/latency hotspots, (ii) fuse memory-dominant kernels to reduce materialized intermediate tensors, and (iii) automatically assign operation-level precisions via a mixed-precision search with bounded relative error. Across the Blender, Habitat, and Replica datasets, our optimised pipeline reduces peak memory from 9.44 GB to 1.11 GB and training time from ~234 min to ~61 min on an A5000 GPU, while preserving (and in some cases improving) reconstruction quality of the state-of-the-art VBGS baseline. We also enable for the first time NVS training on a commercial embedded platform, the Jetson Orin Nano, reducing per-frame latency by 19x compared to 3DGS.
29. 【2603.08498】All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
链接:https://arxiv.org/abs/2603.08498
作者:Yi Yu,Libing Wu,Zhuangzhuang Zhang,Jing Qiu,Lijuan Huo,Jiaqi Feng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:feature-level sensory data, enables multiple vehicles, individual perception capacities, enables multiple, sensory data
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Collaborative perception (CP) enables multiple vehicles to augment their individual perception capacities through the exchange of feature-level sensory data. However, this fusion mechanism is inherently vulnerable to adversarial attacks, especially in fully untrusted-vehicle environments. Existing defense approaches often assume a trusted ego vehicle as a reference or incorporate additional binary classifiers. These assumptions limit their practicality in real-world deployments due to the questionable trustworthiness of ego vehicles, the requirement for real-time detection, and the need for generalizability across diverse scenarios. To address these challenges, we propose a novel Pseudo-Random Bayesian Inference (PRBI) framework, a first efficient defense method tailored for fully untrusted-vehicle CP. PRBI detects adversarial behavior by leveraging temporal perceptual discrepancies, using the reliable perception from the preceding frame as a dynamic reference. Additionally, it employs a pseudo-random grouping strategy that requires only two verifications per frame, while applying Bayesian inference to estimate both the number and identities of malicious vehicles. Theoretical analysis has proven the convergence and stability of the proposed PRBI framework. Extensive experiments show that PRBI requires only 2.5 verifications per frame on average, outperforming existing methods significantly, and restores detection precision to between 79.4% and 86.9% of pre-attack levels.
30. 【2603.08497】Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models
链接:https://arxiv.org/abs/2603.08497
作者:Heng Zhou,Ao Yu,Li Kang,Yuchen Fan,Yutao Fan,Xiufeng Song,Hejia Geng,Yiran Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prove largely typography-blind, largely typography-blind, capable of recognizing, reading text, prove largely
备注:
点击查看摘要
Abstract:Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.
31. 【2603.08491】Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework
链接:https://arxiv.org/abs/2603.08491
作者:Yutong Hu,Jinhui Chen,Chaoqiang Xu,Yuan Kou,Sili Zhou,Shaocheng Yan,Pengcheng Shi,Qingwu Hu,Jiayuan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:matches ground-level text, geo-tagged aerial imagery, ground-level text descriptions, matches ground-level, emergency response
备注:
点击查看摘要
Abstract:Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANet significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at this https URL.
32. 【2603.08486】Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
链接:https://arxiv.org/abs/2603.08486
作者:Qishun Yang,Shu Yang,Lijie Hu,Di Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal large language, enable harmful outputs, inputs enable harmful, Multimodal large, visual inputs enable
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
33. 【2603.08483】X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
链接:https://arxiv.org/abs/2603.08483
作者:Youngseo Kim,Kwan Yun,Seokhyeon Hong,Sihun Cha,Colette Suhjung Koo,Junyong Noh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:highly realistic synthetic, contemporary generative systems, realistic synthetic videos, synthetic videos produced, challenging both humans
备注:
点击查看摘要
Abstract:The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.
34. 【2603.08445】Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation
链接:https://arxiv.org/abs/2603.08445
作者:He-Yen Hsieh,Wei-Te Mark Ting,H.T. Kung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:subtle user-specific variations, degrade model performance, patterns commonly found, eyelid shape, commonly found
备注: 21 pages, 16 figures, AAAI2026
点击查看摘要
Abstract:Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.
Comments:
21 pages, 16 figures, AAAI2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.08445 [cs.CV]
(or
arXiv:2603.08445v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.08445
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2603.08436】Can Vision-Language Models Solve the Shell Game?
链接:https://arxiv.org/abs/2603.08436
作者:Tiedong Liu,Wee Sun Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Vision-Language Models, innate cognitive ability, innate cognitive, remains a critical, critical bottleneck
备注:
点击查看摘要
Abstract:Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at this https URL .
36. 【2603.08434】Information Maximization for Long-Tailed Semi-Supervised Domain Generalization
链接:https://arxiv.org/abs/2603.08434
作者:Leo Fillioux,Omprakash Chakraborty,Quentin Gopée,Pierre Marza,Paul-Henry Cournède,Stergios Christodoulidis,Maria Vakalopoulou,Ismail Ben Ayed,Jose Dolz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semi-supervised domain generalization, tackle domain generalization, domain generalization, Semi-supervised domain, tackle domain
备注:
点击查看摘要
Abstract:Semi-supervised domain generalization (SSDG) has recently emerged as an appealing alternative to tackle domain generalization when labeled data is scarce but unlabeled samples across domains are abundant. In this work, we identify an important limitation that hampers the deployment of state-of-the-art methods on more challenging but practical scenarios. In particular, state-of-the-art SSDG severely suffers in the presence of long-tailed class distributions, an arguably common situation in real-world settings. To alleviate this limitation, we propose IMaX, a simple yet effective objective based on the well-known InfoMax principle adapted to the SSDG scenario, where the Mutual Information (MI) between the learned features and latent labels is maximized, constrained by the supervision from the labeled samples. Our formulation integrates an {\alpha}-entropic objective, which mitigates the class-balance bias encoded in the standard marginal entropy term of the MI, thereby better handling arbitrary class distributions. IMaX can be seamlessly plugged into recent state-of-the-art SSDG, consistently enhancing their performance, as demonstrated empirically across two different image modalities.
37. 【2603.08426】Grow, Assess, Compress: Adaptive Backbone Scaling for Memory-Efficient Class Incremental Learning
链接:https://arxiv.org/abs/2603.08426
作者:Adrian Garcia-Castañeda,Jon Irureta,Jon Imaz,Aizea Lojo
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Class Incremental Learning, Class Incremental, Incremental Learning, prevent catastrophic forgetting, poses a fundamental
备注:
点击查看摘要
Abstract:Class Incremental Learning (CIL) poses a fundamental challenge: maintaining a balance between the plasticity required to learn new tasks and the stability needed to prevent catastrophic forgetting. While expansion-based methods effectively mitigate forgetting by adding task-specific parameters, they suffer from uncontrolled architectural growth and memory overhead. In this paper, we propose a novel dynamic scaling framework that adaptively manages model capacity through a cyclic "GRow, Assess, ComprEss" (GRACE) strategy. Crucially, we supplement backbone expansion with a novel saturation assessment phase that evaluates the utilization of the model's capacity. This assessment allows the framework to make informed decisions to either expand the architecture or compress the backbones into a streamlined representation, preventing parameter explosion. Experimental results demonstrate that our approach achieves state-of-the-art performance across multiple CIL benchmarks, while reducing memory footprint by up to a 73% compared to purely expansionist models.
38. 【2603.08403】SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents
链接:https://arxiv.org/abs/2603.08403
作者:Yu Yang,Yue Liao,Jianbiao Mei,Baisen Wang,Xuemeng Yang,Licheng Wen,Jiangning Zhang,Xiangtai Li,Hanlin Chen,Botian Shi,Yong Liu,Shuicheng Yan,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reflective action world, action world modeling, enables controllable long-horizon, world modeling closed-loop, modeling closed-loop framework
备注: 22 Pages, 11 Figures
点击查看摘要
Abstract:We introduce SPIRAL, a self-improving planning and iterative reflective action world modeling closed-loop framework that enables controllable long-horizon video generation conditioned on high-level semantic actions. Existing one-shot video generation models operate in open-loop, often resulting in incomplete action execution, weak semantic grounding, and temporal drift. SPIRAL formulates ActWM as a closed-loop think-act-reflect process, where generation proceeds step by step under explicit planning and feedback. A PlanAgent decomposes abstract actions into object-centric sub-actions, while a CriticAgent evaluates intermediate results and guides iterative refinement with long-horizon memory. This closed-loop design naturally supports RL evolving optimization, improving semantic alignment and temporal consistency over extended horizons. We further introduce the ActWM-Dataset and ActWM-Bench for training and evaluation. Experiments across multiple TI2V backbones demonstrate consistent gains on ActWM-Bench and mainstream video generation benchmarks, validating SPIRAL's effectiveness.
39. 【2603.08390】StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation
链接:https://arxiv.org/abs/2603.08390
作者:Zhi Wang,Liu Liu,Ruonan Liu,Dan Guo,Meng Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, hand grasp synthesis, manipulation remains significantly, grasp synthesis, significantly more challenging
备注:
点击查看摘要
Abstract:Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.
40. 【2603.08387】AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition
链接:https://arxiv.org/abs/2603.08387
作者:Zhishu Liu,Kaishen Yuan,Bo Zhao,Hui Ma,Zitong Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Micro-expression Action Unit, detection identifies localized, facial muscle activations, Micro-expression Action, Action Unit
备注:
点击查看摘要
Abstract:Micro-expression Action Unit (AU) detection identifies localized AUs from subtle facial muscle activations, providing a foundation for decoding affective cues. Previous methods face three key limitations: (1) heavy reliance on low-density visual information, rendering discriminative evidence vulnerable to background noise; (2) coarse-grained feature processing that misaligns with the demand for fine-grained representations; and (3) neglect of inter-AU correlations, restricting the parsing of complex expression patterns. We propose AULLM++, a reasoning-oriented framework leveraging Large Language Models (LLMs), which injects visual features into textual prompts as actionable semantic premises to guide inference. It formulates AU prediction into three stages: evidence construction, structure modeling, and deduction-based prediction. Specifically, a Multi-Granularity Evidence-Enhanced Fusion Projector (MGE-EFP) fuses mid-level texture cues with high-level semantics, distilling them into a compact Content Token (CT). Furthermore, inspired by micro- and macro-expression AU correspondence, we encode AU relationships as a sparse structural prior and learn interaction strengths via a Relation-Aware AU Graph Neural Network (R-AUGNN), producing an Instruction Token (IT). We then fuse CT and IT into a structured textual prompt and introduce Counterfactual Consistency Regularization (CCR) to construct counterfactual samples, enhancing the model's generalization. Extensive experiments demonstrate AULLM++ achieves state-of-the-art performance on standard benchmarks and exhibits superior cross-domain generalization.
41. 【2603.08386】Real-Time Drone Detection in Event Cameras via Per-Pixel Frequency Analysis
链接:https://arxiv.org/abs/2603.08386
作者:Michael Bezick,Majid Sahin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Detecting fast-moving objects, unmanned aerial vehicle, Discrete Fourier Transforms, Discrete Fourier Transform, Detecting fast-moving
备注:
点击查看摘要
Abstract:Detecting fast-moving objects, such as unmanned aerial vehicle (UAV), from event camera data is challenging due to the sparse, asynchronous nature of the input. Traditional Discrete Fourier Transforms (DFT) are effective at identifying periodic signals, such as spinning rotors, but they assume uniformly sampled data, which event cameras do not provide. We propose a novel per-pixel temporal analysis framework using the Non-uniform Discrete Fourier Transform (NDFT), which we call Drone Detection via Harmonic Fingerprinting (DDHF). Our method uses purely analytical techniques that identify the frequency signature of drone rotors, as characterized by frequency combs in their power spectra, enabling a tunable and generalizable algorithm that achieves accurate real-time localization of UAV. We compare against a YOLO detector under equivalent conditions, demonstrating improvement in accuracy and latency across a difficult array of drone speeds, distances, and scenarios. DDHF achieves an average localization F1 score of 90.89% and average latency of 2.39ms per frame, while YOLO achieves an F1 score of 66.74% and requires 12.40ms per frame. Through utilization of purely analytic techniques, DDHF is quickly tuned on small data, easily interpretable, and achieves competitive accuracies and latencies to deep learning alternatives.
42. 【2603.08374】his Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse
链接:https://arxiv.org/abs/2603.08374
作者:Junhao Jia,Jiaqi Wang,Yunyou Liu,Haodong Jing,Yueyi Wu,Xian Wu,Yefeng Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:based explanation mechanism, intrinsic case based, case based explanation, Prototype networks provide, multiple prototypes degenerate
备注:
点击查看摘要
Abstract:Prototype networks provide an intrinsic case based explanation mechanism, but their interpretability is often undermined by prototype collapse, where multiple prototypes degenerate to highly redundant evidence. We attribute this failure mode to the terminal dynamics of Neural Collapse, where cross entropy optimization suppresses intra class variance and drives class conditional features toward a low dimensional limit. To mitigate this, we propose Adaptive Manifold Prototypes (AMP), a framework that leverages Riemannian optimization on the Stiefel manifold to represent class prototypes as orthonormal bases and make rank one prototype collapse infeasible by construction. AMP further learns class specific effective rank via a proximal gradient update on a nonnegative capacity vector, and introduces spatial regularizers that reduce rotational ambiguity and encourage localized, non overlapping part evidence. Extensive experiments on fine-grained benchmarks demonstrate that AMP achieves state-of-the-art classification accuracy while significantly improving causal faithfulness over prior interpretable models.
43. 【2603.08364】Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation
链接:https://arxiv.org/abs/2603.08364
作者:Zekun Li,Yinghuan Shi,Yang Gao,Dong Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion-based data augmentation, Diffusion-based data, improving classification performance, data augmentation, data scarcity
备注:
点击查看摘要
Abstract:Diffusion-based data augmentation (DiffDA) has emerged as a promising approach to improving classification performance under data scarcity. However, existing works vary significantly in task configurations, model choices, and experimental pipelines, making it difficult to fairly compare methods or assess their effectiveness across different scenarios. Moreover, there remains a lack of systematic understanding of the full DiffDA workflow. In this work, we introduce UniDiffDA, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization. This perspective enables us to identify key differences among existing methods and clarify the overall design space. Building on this framework, we develop a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks. Extensive experiments reveal the relative strengths and limitations of different DiffDA strategies and offer practical insights into method design and deployment. All methods are re-implemented within a unified codebase, with full release of code and configurations to ensure reproducibility and to facilitate future research.
44. 【2603.08361】$Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation
链接:https://arxiv.org/abs/2603.08361
作者:Yijie Zhu,Jie He,Rui Shao,Kaishen Yuan,Tao Tan,Xiaochen Yuan,Zitong Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advanced robotic manipulation, significantly advanced robotic, unifying perception, significantly advanced, manipulation by unifying
备注:
点击查看摘要
Abstract:Recent vision-language-action (VLA) models have significantly advanced robotic manipulation by unifying perception, reasoning, and control. To achieve such integration, recent studies adopt a predictive paradigm that models future visual states or world knowledge to guide action generation. However, these models emphasize forecasting outcomes rather than reasoning about the underlying process of change, which is essential for determining how to act. To address this, we propose $\Delta$VLA, a prior-guided framework that models world-knowledge variations relative to an explicit current-world knowledge prior for action generation, rather than regressing absolute future world states. Specifically, 1) to construct the current world knowledge prior, we propose the Prior-Guided WorldKnowledge Extractor (PWKE). It extracts manipulable regions, spatial relations, and semantic cues from the visual input, guided by auxiliary heads and prior pseudo labels, thus reducing redundancy. 2) Building upon this, to represent how world knowledge evolves under actions, we introduce the Latent World Variation Quantization (LWVQ). It learns a discrete latent space via a VQ-VAE objective to encode world knowledge variations, shifting prediction from full modalities to compact latent. 3)Moreover, to mitigate interference during variation modeling, we design the Conditional Variation Attention (CV-Atten), whichpromotes disentangled learning and preserves the independence of knowledge representations. Extensive experiments on both simulated benchmarks and real-world robotic tasks demonstrate $\Delta$VLA achieves state-of-the-art performance while improving efficiency. Code and real-world execution videos are available at this https URL.
45. 【2603.08347】Local-Global Prompt Learning via Sparse Optimal Transport
链接:https://arxiv.org/abs/2603.08347
作者:Deniz Kizaroğlu,Ülku Tuncer Küçüktas,Emre Çakmakyurdu,Alptekin Temizel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CLIP typically relies, global image embeddings, textual prompts matched, image embeddings, adaptation of vision-language
备注: 9 pages, 3 figures, 4 tables. Code available at GitHub
点击查看摘要
Abstract:Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: this https URL
46. 【2603.08328】Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology
链接:https://arxiv.org/abs/2603.08328
作者:Mina Jamshidi Idaji,Julius Hense,Tom Neuhäuser,Augustin Krause,Yanqing Luo,Oliver Eberle,Thomas Schnake,Laure Ciernik,Farnoush Rezaei Jafari,Reza Vahidimajd,Jonas Dippel,Christoph Walz,Frederick Klauschen,Andreas Mock,Klaus-Robert Müller
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multiple instance learning, enabled substantial progress, Multiple instance, instance learning, MIL
备注:
点击查看摘要
Abstract:Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: this https URL
47. 【2603.08317】Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations
链接:https://arxiv.org/abs/2603.08317
作者:Sadegh Rahmaniboldaji,Filip Rybansky,Quoc C. Vuong,Anya C. Hurlbert,Frank Guerin,Andrew Gilbert
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:involving low resolution, Humans consistently outperform, challenging real-world conditions, real-world conditions involving, Identifiable Recognition Crops
备注:
点击查看摘要
Abstract:Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
48. 【2603.08316】SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
链接:https://arxiv.org/abs/2603.08316
作者:Junxian Li,Tu Lan,Haozhen Tan,Yan Meng,Haojin Zhu
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:based graphical user, graphical user interface, execute actions accurately, user interface, graphical user
备注: 25 pages
点击查看摘要
Abstract:Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in this https URL.
49. 【2603.08313】HDR-NSFF: High Dynamic Range Neural Scene Flow Fields
链接:https://arxiv.org/abs/2603.08313
作者:Shin Dong-Yeon,Kim Jun-Seong,Kwon Byung-Ki,Tae-Hyun Oh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:wider dynamic range, scenes typically spans, typically spans, standard cameras, HDR
备注: ICLR 2026. Project page: [this https URL](https://shin-dong-yeon.github.io/HDR-NSFF/)
点击查看摘要
Abstract:Radiance of real-world scenes typically spans a much wider dynamic range than what standard cameras can capture. While conventional HDR methods merge alternating-exposure frames, these approaches are inherently constrained to 2D pixel-level alignment, often leading to ghosting artifacts and temporal inconsistency in dynamic scenes. To address these limitations, we present HDR-NSFF, a paradigm shift from 2D-based merging to 4D spatio-temporal modeling. Our framework reconstructs dynamic HDR radiance fields from alternating-exposure monocular videos by representing the scene as a continuous function of space and time, and is compatible with both neural radiance field and 4D Gaussian Splatting (4DGS) based dynamic representations. This unified end-to-end pipeline explicitly models HDR radiance, 3D scene flow, geometry, and tone-mapping, ensuring physical plausibility and global coherence. We further enhance robustness by (i) extending semantic-based optical flow with DINO features to achieve exposure-invariant motion estimation, and (ii) incorporating a generative prior as a regularizer to compensate for limited observation in monocular captures and saturation-induced information loss. To evaluate HDR space-time view synthesis, we present the first real-world HDR-GoPro dataset specifically designed for dynamic HDR scenes. Experiments demonstrate that HDR-NSFF recovers fine radiance details and coherent dynamics even under challenging exposure variations, thereby achieving state-of-the-art performance in novel space-time view synthesis. Project page: this https URL
50. 【2603.08309】Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
链接:https://arxiv.org/abs/2603.08309
作者:Yehonatan Elisha,Oren Barkan,Noam Koenigstein
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:semantically meaningful features, Vision Transformers, meaningful features, semantically meaningful, Transformers
备注: CVPR 2026 ; Project page: [this https URL](https://yonisgit.github.io/concept-ft/)
点击查看摘要
Abstract:Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.
51. 【2603.08305】Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation
链接:https://arxiv.org/abs/2603.08305
作者:Daniele Molino,Camillo Maria Caruso,Paolo Soda,Valerio Guarrasi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:imaging provide semantic, provide semantic control, medical imaging provide, anatomically inconsistent, Text-conditioned generative models
备注:
点击查看摘要
Abstract:Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.
52. 【2603.08289】Novel Semantic Prompting for Zero-Shot Action Recognition
链接:https://arxiv.org/abs/2603.08289
作者:Salman Iqbal,Waheed Rehman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Zero-shot action recognition, Zero-shot action, relies on transferring, transferring knowledge, action recognition relies
备注:
点击查看摘要
Abstract:Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.
53. 【2603.08279】OSCAR: Occupancy-based Shape Completion via Acoustic Neural Implicit Representations
链接:https://arxiv.org/abs/2603.08279
作者:Magdalena Wysocki,Kadir Burak Buldu,Miruna-Alexandra Gafencu,Mohammad Farid Azampour,Nassir Navab
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:invasive spine interventions, guiding minimally invasive, minimally invasive spine, remains challenging due, view-dependent signal variations
备注:
点击查看摘要
Abstract:Accurate 3D reconstruction of vertebral anatomy from ultrasound is important for guiding minimally invasive spine interventions, but it remains challenging due to acoustic shadowing and view-dependent signal variations. We propose an occupancy-based shape completion method that reconstructs complete 3D anatomical geometry from partial ultrasound observations. Crucially for intra-operative applications, our approach extracts the anatomical surface directly from the image, avoiding the need for anatomical labels during inference. This label-free completion relies on a coupled latent space representing both the image appearance and the underlying anatomical shape. By leveraging a Neural Implicit Representation (NIR) that jointly models both spatial occupancy and acoustic interactions, the method uses acoustic parameters to become implicitly aware of the unseen regions without explicit shadowing labels through tracking acoustic signal transmission. We show that this method outperforms state-of-the-art shape completion for B-mode ultrasound by 80% in HD95 score. We validate our approach both in-silico and on phantom US images with registered mesh models from CT labels, demonstrating accurate reconstruction of occluded anatomy and robust generalization across diverse imaging conditions. Code and data will be released on publication.
54. 【2603.08271】Prototype-Guided Concept Erasure in Diffusion Models
链接:https://arxiv.org/abs/2603.08271
作者:Yuze Cai,Jiahao Lu,Hongxiang Shi,Yichao Zhou,Hong Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating undesired content, undesired content, extensively utilized, generating undesired, Elon Musk
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such as ``sexual'' or ``violent'', whose wide scope and multi-faceted nature make them difficult to erase reliably. To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of concept prototypes that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure. Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.
55. 【2603.08264】Event-based Motion Appearance Fusion for 6D Object Pose Tracking
链接:https://arxiv.org/abs/2603.08264
作者:Zhichao Li,Chiara Bartolozzi,Lorenzo Natale,Arren Glover
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:industrial settings, essential task, fundamental and essential, home and industrial, pose
备注:
点击查看摘要
Abstract:Object pose tracking is a fundamental and essential task for robotics to perform tasks in the home and industrial settings. The most commonly used sensors to do so are RGB-D cameras, which can hit limitations in highly dynamic environments due to motion blur and frame-rate constraints. Event cameras have remarkable features such as high temporal resolution and low latency, which make them a potentially ideal vision sensors for object pose tracking at high speed. Even so, there are still only few works on 6D pose tracking with event cameras. In this work, we take advantage of the high temporal resolution and propose a method that uses both a propagation step fused with a pose correction strategy. Specifically, we use 6D object velocity obtained from event-based optical flow for pose propagation, after which, a template-based local pose correction module is utilized for pose correction. Our learning-free method has comparable performance to the state-of-the-art algorithms, and in some cases out performs them for fast-moving objects. The results indicate the potential for using event cameras in highly-dynamic scenarios where the use of deep network approaches are limited by low update rates.
56. 【2603.08258】WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
链接:https://arxiv.org/abs/2603.08258
作者:Lei Wang,Yang Cheng,Senmao Li,Ge Wu,Yaxing Wang,Jian Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limits practical deployment, slow inference limits, inference limits practical, Stable Diffusion, practical deployment
备注: Accepted to CVPR 2026;Code: [this https URL](https://github.com/gudaochangsheng/WaDi)
点击查看摘要
Abstract:Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multi-step diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose the Low-rank Rotation of weight Direction (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in Weight Direction-aware Distillation (WaDi)-a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.
57. 【2603.08254】DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
链接:https://arxiv.org/abs/2603.08254
作者:Zhuolin He,Jing Li,Guanghao Li,Xiaolei Chen,Jiacheng Tang,Siyang Zhang,Zhounan Jin,Feipeng Cai,Bin Li,Jian Pu,Jia Cai,Xiangyang Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental challenge due, significant temporal variations, moving objects, Dynamic, remains a fundamental
备注:
点击查看摘要
Abstract:Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.
58. 【2603.08245】opologically Stable Hough Transform
链接:https://arxiv.org/abs/2603.08245
作者:Stefan Huber,Kristóf Huszár,Michael Kerber,Martin Uray
类目:Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
关键词:well-known Hough transform, Hough transform, classical Hough transform, point clouds, well-known Hough
备注: Extended abstract will be presented at EuroCG'26; 11 pages, 7 figures
点击查看摘要
Abstract:We propose an alternative formulation of the well-known Hough transform to detect lines in point clouds. Replacing the discretized voting scheme of the classical Hough transform by a continuous score function, its persistent features in the sense of persistent homology give a set of candidate lines. We also devise and implement an algorithm to efficiently compute these candidate lines.
59. 【2603.08240】SiMO: Single-Modality-Operable Multimodal Collaborative Perception
链接:https://arxiv.org/abs/2603.08240
作者:Jiageng Wen,Shengjie Zhao,Bing Li,Jiafeng Huang,Kenan Ye,Hao Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:integrates multi-agent perspectives, perception integrates multi-agent, Collaborative perception integrates, overcome occlusion issues, Collaborative perception
备注: Accepted to ICLR 2026. This arXiv version includes an additional appendix (Appendix 15) containing further philosophical discussion not included in the official ICLR peer-reviewed version
点击查看摘要
Abstract:Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in this https URL.
60. 【2603.08235】Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema
链接:https://arxiv.org/abs/2603.08235
作者:Pablo Jimenez-Lizcano,Sergio Romero-Tapiador,Ruben Tolosana,Aythami Morales,Guillermo González de Rivera,Ruben Vera-Rodriguez,Julian Fierrez
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:diabetic macular edema, macular edema, working-age adults, preventable blindness, blindness among working-age
备注: 6 pages, 4 figures, 2 tables
点击查看摘要
Abstract:Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.
61. 【2603.08228】GarmentPainter: Efficient 3D Garment Texture Synthesis with Character-Guided Diffusion Model
链接:https://arxiv.org/abs/2603.08228
作者:Jinbo Wu,Xiaobo Gao,Xing Liu,Chen Zhao,Jialun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging problem due, Generating high-fidelity, globally consistent texture, consistent texture synthesis, requirement for detailed
备注:
点击查看摘要
Abstract:Generating high-fidelity, 3D-consistent garment textures remains a challenging problem due to the inherent complexities of garment structures and the stringent requirement for detailed, globally consistent texture synthesis. Existing approaches either rely on 2D-based diffusion models, which inherently struggle with 3D consistency, require expensive multi-step optimization or depend on strict spatial alignment between 2D reference images and 3D meshes, which limits their flexibility and scalability. In this work, we introduce GarmentPainter, a simple yet efficient framework for synthesizing high-quality, 3D-aware garment textures in UV space. Our method leverages a UV position map as the 3D structural guidance, ensuring texture consistency across the garment surface during texture generation. To enhance control and adaptability, we introduce a type selection module, enabling fine-grained texture generation for specific garment components based on a character reference image, without requiring alignment between the reference image and the 3D mesh. GarmentPainter efficiently integrates all guidance signals into the input of a diffusion model in a spatially aligned manner, without modifying the underlying UNet architecture. Extensive experiments demonstrate that GarmentPainter achieves state-of-the-art performance in terms of visual fidelity, 3D consistency, and computational efficiency, outperforming existing methods in both qualitative and quantitative evaluations.
62. 【2603.08227】SRNeRV: A Scale-wise Recursive Framework for Neural Video Representation
链接:https://arxiv.org/abs/2603.08227
作者:Jia Wang,Jun Zhu,Xinfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit Neural Representations, Implicit Neural, Neural Representations, video representation, representation and compression
备注: Accepted by IEEE ISCAS 2026
点击查看摘要
Abstract:Implicit Neural Representations (INRs) have emerged as a promising paradigm for video representation and compression. However, existing multi-scale INR generators often suffer from significant parameter redundancy by stacking independent processing blocks for each scale. Inspired by the principle of scale self-similarity in the generation process, we propose SRNeRV, a novel scale-wise recursive framework that replaces this stacked design with a parameter-efficient shared architecture. The core of our approach is a hybrid sharing scheme derived from decoupling the processing block into a scale-specific spatial mixing module and a scale-invariant channel mixing module. We recursively apply the same shared channel mixing module, which contains the majority of the parameters, across all scales, significantly reducing the model size while preserving the crucial capacity to learn scale-specific spatial patterns. Extensive experiments demonstrate that SRNeRV achieves a significant rate-distortion performance boost, especially in INR-friendly scenarios, validating that our sharing scheme successfully amplifies the core strengths of the INR paradigm.
63. 【2603.08224】SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
链接:https://arxiv.org/abs/2603.08224
作者:Ruixiang Zhao,Zhihao Xu,Bangxiang Lan,Zijie Xin,Jingyu Liu,Xirong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video-text retrieval, facto choice, CLIP, Speech Aware Video, Aware Video rEpresentation
备注: Accepted to CVPR2026
点击查看摘要
Abstract:For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.
64. 【2603.08210】Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA
链接:https://arxiv.org/abs/2603.08210
作者:Zexi Wu,Qinghe Wang,Jing Dai,Baolu Li,Yiming Zhang,Yue Ma,Xu Jia,Hongming Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Achieving semantic alignment, significant challenge, Achieving semantic, remains a significant, Achieving
备注: 10 pages
点击查看摘要
Abstract:Achieving semantic alignment across diverse video generation conditions remains a significant challenge. Methods that rely on explicit structural guidance often enforce rigid spatial constraints that limit semantic flexibility, whereas models tailored for individual control types lack interoperability and adaptability. These design bottlenecks hinder progress toward flexible and efficient semantic video generation. To address this, we propose Video2LoRA, a scalable and generalizable framework for semantic-controlled video generation that conditions on a reference video. Video2LoRA employs a lightweight hypernetwork to predict personalized LoRA weights for each semantic input, which are combined with auxiliary matrices to form adaptive LoRA modules integrated into a frozen diffusion backbone. This design enables the model to generate videos consistent with the reference semantics while preserving key style and content variations, eliminating the need for any per-condition training. Notably, the final model weights less than 150MB, making it highly efficient for storage and deployment. Video2LoRA achieves coherent, semantically aligned generation across diverse conditions and exhibits strong zero-shot generalization to unseen semantics.
65. 【2603.08208】Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors
链接:https://arxiv.org/abs/2603.08208
作者:Ishrat Jahan,Molla E Majid,M Murugappan,Muhammad E. H. Chowdhury,N.B.Prakash,Saad Bin Abul Kashem,Balamurugan Balusamy,Amith Khandakar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Reliable unmanned aerial, unmanned aerial vehicle, autonomous airspace monitoring, Reliable unmanned, aerial vehicle
备注:
点击查看摘要
Abstract:Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.
66. 【2603.08202】MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
链接:https://arxiv.org/abs/2603.08202
作者:Siarhei Sheludzko,Dhimitrios Duka,Bernt Schiele,Hilde Kuehne,Anna Kukleva
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multi-modal contrastive learning, Contrastive learning, temperature, learning, multi-modal
备注: 18 pages, 11 figures. Accepted at WACV 2026
点击查看摘要
Abstract:Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.
67. 【2603.08199】Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking
链接:https://arxiv.org/abs/2603.08199
作者:Xian Wu,Yitao Wu,Xiaoyu Li,Zijia Li,Lijun Zhao,Lining Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:combines rich visual, rich visual semantics, accurate depth cues, multi-object tracking, tracking reliability
备注:
点击查看摘要
Abstract:LiDAR-camera 3D multi-object tracking (MOT) combines rich visual semantics with accurate depth cues to improve trajectory consistency and tracking reliability. In practice, however, LiDAR and cameras operate at different sampling rates. To maintain temporal alignment, existing data pipelines usually synchronize heterogeneous sensor streams and annotate them at a reduced shared frequency, forcing most prior methods to perform spatial fusion only at synchronized timestamps through projection-based or learnable cross-sensor association. As a result, abundant asynchronous observations remain underexploited, despite their potential to support more frequent association and more robust trajectory estimation over short temporal intervals. To address this limitation, we propose Fusion-Poly, a spatial-temporal fusion framework for 3D MOT that integrates asynchronous LiDAR and camera data. Fusion-Poly associates trajectories with multi-modal observations at synchronized timestamps and with single-modal observations at asynchronous timestamps, enabling higher-frequency updates of motion and existence states. The framework contains three key components: a frequency-aware cascade matching module that adapts to synchronized and asynchronous frames according to available detection modalities; a frequency-aware trajectory estimation module that maintains trajectories through high-frequency motion prediction, differential updates, and confidence-calibrated lifecycle management; and a full-state observation alignment module that improves cross-modal consistency at synchronized timestamps by optimizing image-projection errors. On the nuScenes test set, Fusion-Poly achieves 76.5% AMOTA, establishing a new state of the art among tracking-by-detection 3D MOT methods. Extensive ablation studies further validate the effectiveness of each component. Code will be released.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:
arXiv:2603.08199 [cs.CV]
(or
arXiv:2603.08199v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.08199
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
68. 【2603.08180】ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection
链接:https://arxiv.org/abs/2603.08180
作者:Michael Kösel,Marcel Schreiber,Michael Ulrich,Claudius Gläser,Klaus Dietmayer
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:autonomous driving systems, safe autonomous driving, driving systems, plays a critical, critical role
备注: Accepted for publication at the 2025 IEEE Intelligent Transportation Systems Conference (ITSC)
点击查看摘要
Abstract:LiDAR-based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so-called out-of-distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out-Of-Distribution Detection), a novel approach that incorporates language representations from a vision-language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero-shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at this https URL.
69. 【2603.08174】MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
链接:https://arxiv.org/abs/2603.08174
作者:Junyu Shen,Zhendong She,Chenghanyu Zhang,Yuchuang Sun,Luqing Luo,Dingwei Tan,Zonghao Guo,Bo Guo,Zehua Han,Wupeng Xie,Yaxin Mu,Peng Zhang,Peipei Li,Fengxiang Wang,Yangang Sun,Maosong Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) Data. The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) Benchmark. The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) Model. A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.08174 [cs.CV]
(or
arXiv:2603.08174v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.08174
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
70. 【2603.08150】Edged USLAM: Edge-Aware Event-Based SLAM with Learning-Based Depth Priors
链接:https://arxiv.org/abs/2603.08150
作者:Şebnem Sarıözkan,Hürkan Şahin,Olaya Álvarez-Tuñón,Erdal Kayacan
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Conventional visual simultaneous, abrupt lighting transitions, lighting transitions due, limited dynamic range, high dynamic range
备注: 8 pages, 7 figures, 3 tables. Accepted to ICRA 2026. Project code and datasets available at [this https URL](https://github.com/sebnem-byte/Edged-USLAM)
点击查看摘要
Abstract:Conventional visual simultaneous localization and mapping (SLAM) algorithms often fail under rapid motion, low illumination, or abrupt lighting transitions due to motion blur and limited dynamic range. Event cameras mitigate these issues with high temporal resolution and high dynamic range (HDR), but their sparse, asynchronous outputs complicate feature extraction and integration with other sensors; e.g. inertial measurement units (IMUs) and standard cameras. We present Edged USLAM, a hybrid visual-inertial system that extends Ultimate SLAM (USLAM) with an edge-aware front-end and a lightweight depth module. The frontend enhances event frames for robust feature tracking and nonlinear motion compensation, while the depth module provides coarse, region-of-interest (ROI)-based scene depth to improve motion compensation and scale consistency. Evaluations across public benchmarks and real-world unmanned air vehicle (UAV) flights demonstrate that performance varies significantly by scenario. For instance, event-only methods like point-line event-based visual-inertial odometry (PL-EVIO) or learning-based pipelines such as deep event-based visual odometry (DEVO) excel in highly aggressive or extreme HDR conditions. In contrast, Edged USLAM provides superior stability and minimal drift in slow or structured trajectories, ensuring consistently accurate localization on real flights under challenging illumination. These findings highlight the complementary strengths of event-only, learning-based, and hybrid approaches, while positioning Edged USLAM as a robust solution for diverse aerial navigation tasks.
71. 【2603.08147】MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
链接:https://arxiv.org/abs/2603.08147
作者:Hunor Laczkó,Libang Jia,Loc-Phat Truong,Diego Hernández,Sergio Escalera,Jordi Gonzalez,Meysam Madadi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human datasets fall, datasets fall short, fashion-specific research, lacking either realistic, fall short
备注:
点击查看摘要
Abstract:Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g. rolled sleeves, tucked shirt). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis. The dataset is available at this https URL .
72. 【2603.08135】VesselFusion: Diffusion Models for Vessel Centerline Extraction from 3D CT Images
链接:https://arxiv.org/abs/2603.08135
作者:Soichi Mita,Shumpei Takezaki,Ryoma Bise
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduces annotation effort, natural vessel structures, important task, reduces annotation, annotation effort
备注:
点击查看摘要
Abstract:Vessel centerline extraction from 3D CT images is an important task because it reduces annotation effort to build a model that estimates a vessel structure. It is challenging to estimate natural vessel structures since conventional approaches are deterministic models, which cannot capture a complex human structure. In this study, we propose VesselFusion, which is a diffusion model to extract the vessel centerline from 3D CT image. The proposed method uses a coarse-to-fine representation of the centerline and a voting-based aggregation for a natural and stable extraction. VesselFusion was evaluated on a publicly available CT image dataset and achieved higher extraction accuracy and a more natural result than conventional approaches.
73. 【2603.08133】Fast Low-light Enhancement and Deblurring for 3D Dark Scenes
链接:https://arxiv.org/abs/2603.08133
作者:Feng Zhang,Jinglong Wang,Ze Li,Yanghong Zhou,Yang Chen,Lei Chen,Xiatian Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:motion-blurred imagery remains, challenging task, view synthesis, motion-blurred imagery, imagery remains
备注: 5 pages, 2 figures, Accepted at ICASSP 2026
点击查看摘要
Abstract:Novel view synthesis from low-light, noisy, and motion-blurred imagery remains a valuable and challenging task. Current volumetric rendering methods struggle with compound degradation, and sequential 2D preprocessing introduces artifacts due to interdependencies. In this work, we introduce FLED-GS, a fast low-light enhancement and deblurring framework that reformulates 3D scene restoration as an alternating cycle of enhancement and reconstruction. Specifically, FLED-GS inserts several intermediate brightness anchors to enable progressive recovery, preventing noise blow-up from harming deblurring or geometry. Each iteration sharpens inputs with an off-the-shelf 2D deblurrer and then performs noise-aware 3DGS reconstruction that estimates and suppresses noise while producing clean priors for the next level. Experiments show FLED-GS outperforms state-of-the-art LuSh-NeRF, achieving 21$\times$ faster training and 11$\times$ faster rendering.
74. 【2603.08131】UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing
链接:https://arxiv.org/abs/2603.08131
作者:Jiaxi Zhang,Yunheng Wang,Wei Lu,Taowen Wang,Weisheng Xu,Shuning Zhang,Yixiao Feng,Yuetong Fang,Renjing Xu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language descriptions, Understanding and localizing, augmented reality, language descriptions, implications for robotics
备注: 14 pages,6 figures,3 tables
点击查看摘要
Abstract:Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
75. 【2603.08126】Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
链接:https://arxiv.org/abs/2603.08126
作者:Shentong Mo,Yibing Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:inputs typically requires, video inputs typically, strict audio-visual, audio, inputs typically
备注:
点击查看摘要
Abstract:Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
76. 【2603.08113】SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving
链接:https://arxiv.org/abs/2603.08113
作者:Zihan You,Hongwei Liu,Chenxu Dang,Zhe Wang,Sining Ang,Aoqi Wang,Yan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:VLA models results, Large Language Models, shown promising capabilities, empirical analysis reveals, directly applying existing
备注:
点击查看摘要
Abstract:Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level this http URL address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer this http URL code will be released soon.
77. 【2603.08100】Adaptive MLP Pruning for Large Vision Transformers
链接:https://arxiv.org/abs/2603.08100
作者:Chengchao Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present impressive scalability, transformers present impressive, increased model capacity, Large vision transformers, vision transformers present
备注:
点击查看摘要
Abstract:Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant computational and memory demands. By analyzing prevalent transformer structures, we find that multilayer perceptron (MLP) modules constitute the largest share of the model's parameters. In this paper, we propose an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation. First, we adopt Taylor based method to evaluate neuron importance of MLP. However, the importance computation using one-hot cross entropy loss ignores the potential predictions on other categories, thus degrading the quality of the evaluated importance scores. To address this issue, we introduce label-free information entropy criterion to fully model the predictions of the original model for more accurate importance evaluation. Second, we rank the hidden neurons of MLP by the above importance scores and apply binary search algorithm to adaptively prune the ranked neurons according to the redundancy of different MLP modules, thereby avoiding the predefined compression ratio. Experimental results on several state-of-the-art large vision transformers, including CLIP and DINOv2, demonstrate that our method achieves roughly 40\% parameter and FLOPs reduction in a near lossless manner. Moreover, when the models are not finetuned after pruning, our method outperforms other pruning methods by significantly large margin. The source code and trained weights are available at this https URL.
78. 【2603.08096】rianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
链接:https://arxiv.org/abs/2603.08096
作者:Bryce Grant,Aryeh Rothenberg,Atri Banerjee,Peng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing methods face, Localizing objects, space is essential, objects and parts, parts from natural
备注:
点击查看摘要
Abstract:Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at this https URL.
79. 【2603.08090】DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
链接:https://arxiv.org/abs/2603.08090
作者:Zhenyu Hu,Qing Wang,Te Cao,Luo Liao,Longfei Lu,Liqun Liu,Shuang Li,Hang Chen,Mengge Xue,Yuan Chen,Chao Deng,Peng Shu,Huan Yu,Jie Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:images depicting target, depicting target subjects, user instructions, aims to synthesize, depicting target
备注:
点击查看摘要
Abstract:Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.
80. 【2603.08086】From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation
链接:https://arxiv.org/abs/2603.08086
作者:Yudai Noda,Kanji Tanaka
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Object-Goal Navigation, Large Language Model, recent Large Language, unknown environments, find and navigate
备注: 6 pages, 5 figures, technical report
点击查看摘要
Abstract:Object-Goal Navigation (ObjectNav) requires an agent to find and navigate to a target object category in unknown environments. While recent Large Language Model (LLM)-based agents exhibit zero-shot reasoning, they often rely on a "reactive" paradigm that lacks explicit spatial memory, leading to redundant exploration and myopic behaviors. To address these limitations, we propose a transition from reactive AI to "Map-Based AI" by integrating LLM-based semantic inference with a hybrid topological-grid mapping system. Our framework employs a fine-tuned Llama-2 model via Low-Rank Adaptation (LoRA) to infer semantic zone categories and target existence probabilities from verbalized object observations. In this study, a "zone" is defined as a functional area described by the set of observed objects, providing crucial semantic co-occurrence cues for finding the target. This semantic information is integrated into a topological graph, enabling the agent to prioritize high-probability areas and perform systematic exploration via Traveling Salesman Problem (TSP) optimization. Evaluations in the AI2-THOR simulator demonstrate that our approach significantly outperforms traditional frontier exploration and reactive LLM baselines, achieving a superior Success Rate (SR) and Success weighted by Path Length (SPL).
81. 【2603.08075】ALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
链接:https://arxiv.org/abs/2603.08075
作者:Yanan Wu,Yuhan Yan,Tailai Chen,Zhixiang Chi,ZiZhang Wu,Yi Jin,Yang Wang,Zhenbo Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unlabeled online stream, aims to recognize, online stream, unlabeled online, simultaneously discovering
备注: 14 pages, 6 figures, accepted by CVPR 2026
点击查看摘要
Abstract:On-the-fly category discovery (OCD) aims to recognize known categories while simultaneously discovering novel ones from an unlabeled online stream, using a model trained only on labeled data. Existing approaches freeze the feature extractor trained offline and employ a hash-based framework that quantizes features into binary codes as class prototypes. However, discovering novel categories with a fixed knowledge base is counterintuitive, as the learning potential of incoming data is entirely neglected. In addition, feature quantization introduces information loss, diminishes representational expressiveness, and amplifies intra-class variance. It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. It incorporates two complementary strategies: a semantic-aware prototype update and a stable test-time encoder update. The former dynamically refines class prototypes to enhance classification, whereas the latter integrates new information directly into the parameter space. Together, these components allow the model to continuously expand its knowledge base with newly encountered samples. Furthermore, we introduce a margin-aware logit calibration in the offline stage to enlarge inter-class margins and improve intra-class compactness, thereby reserving embedding space for future class discovery. Experiments on standard OCD benchmarks demonstrate that our method substantially outperforms existing hash-based state-of-the-art approaches, yielding notable improvements in novel-class accuracy and effectively mitigating category explosion. The code is publicly available at \textcolor{blue}{this https URL}.
82. 【2603.08069】Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.08069
作者:Xuesong Wang,Caisheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Utility companies increasingly, companies increasingly rely, classifiers remains difficult, Utility companies, accurate defect-type classifiers
备注: Submitted to Engineering Applications of Artificial Intelligence, Feb. 16, 2026
点击查看摘要
Abstract:Utility companies increasingly rely on drone imagery for post-event and routine inspection, but training accurate defect-type classifiers remains difficult because defect examples are rare and inspection datasets are often limited or proprietary. We address this data-scarcity setting by using an off-the-shelf multimodal large language model (MLLM) as a training-free image generator to synthesize defect images from visual references and text prompts. Our pipeline increases diversity via dual-reference conditioning, improves label fidelity with lightweight human verification and prompt refinement, and filters the resulting synthetic pool using an embedding-based selection rule based on distances to class centroids computed from the real training split. We evaluate on ceramic insulator defect-type classification (shell vs. glaze) using a public dataset with a realistic low training-data regime (104 real training images; 152 validation; 308 test). Augmenting the 10% real training set with embedding-selected synthetic images improves test F1 score (harmonic mean of precision and recall) from 0.615 to 0.739 (20% relative), corresponding to an estimated 4--5x data-efficiency gain, and the gains persist with stronger backbone models and frozen-feature linear-probe baselines. These results suggest a practical, low-barrier path for improving defect recognition when collecting additional real defects is slow or infeasible.
83. 【2603.08064】Evaluating Generative Models via One-Dimensional Code Distributions
链接:https://arxiv.org/abs/2603.08064
作者:Zexi Jia,Pengcheng Luo,Yijia Zhong,Jinchao Zhang,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:continuous recognition features, discard cues critical, Codebook Histogram Distance, generative models rely, Mixture Model Score
备注:
点击查看摘要
Abstract:Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.
84. 【2603.08063】Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling
链接:https://arxiv.org/abs/2603.08063
作者:Bowen Liu,Pengyue Jia,Wanyu Wang,Derong Xu,Jiawei Cheng,Jiancheng Dong,Xiao Han,Zimo Zhao,Chao Zhang,Bowen Yu,Fangyu Hong,Xiangyu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exact spatial coordinates, cross-view UAV geolocalization, geo-referenced satellite databases, primary objective, objective of cross-view
备注:
点击查看摘要
Abstract:The primary objective of cross-view UAV geolocalization is to identify the exact spatial coordinates of drone-captured imagery by aligning it with extensive, geo-referenced satellite databases. Current approaches typically extract features independently from each perspective and rely on basic heuristics to compute similarity, thereby failing to explicitly capture the essential interactions between different views. To address this limitation, we introduce a novel, plug-and-play ranking architecture designed to explicitly perform joint relational modeling for improved UAV-to-satellite image matching. By harnessing the capabilities of a Large Vision-Language Model (LVLM), our framework effectively learns the deep visual-semantic correlations linking UAV and satellite imagery. Furthermore, we present a novel relational-aware loss function to optimize the training phase. By employing soft labels, this loss provides fine-grained supervision that avoids overly penalizing near-positive matches, ultimately boosting both the model's discriminative power and training stability. Comprehensive evaluations across various baseline architectures and standard benchmarks reveal that the proposed method substantially boosts the retrieval accuracy of existing models, yielding superior performance even under highly demanding conditions.
85. 【2603.08059】ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning
链接:https://arxiv.org/abs/2603.08059
作者:Yiran Zhao,Yaoqi Ye,Xiang Liu,Michael Qizhe Shieh,Trung Bui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:garnered significant attention, significant attention due, commercial multi-modal models, daily life, rapid advancement
备注:
点击查看摘要
Abstract:With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.
86. 【2603.08057】See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming
链接:https://arxiv.org/abs/2603.08057
作者:Petr Vanc,Jan Kristof Behrens,Václav Hlaváč,Karla Stepanova
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:real-world variability remains, Programming robots, intuitive concept, real-world variability, variability remains
备注: 8 pages, 11 figures
点击查看摘要
Abstract:Programming robots by demonstration (PbD) is an intuitive concept, but scaling it to real-world variability remains a challenge for most current teaching frameworks. Conditional task graphs are very expressive and can be defined incrementally, which fits very well with the PbD idea. However, acting using conditional task graphs requires reliable perception-grounded online branch selection. In this paper, we present See Switch, an interactive teaching-and-execution framework that represents tasks as user-extendable graphs of skill parts connected via decision states (DS), enabling conditional branching during replay. Unlike prior approaches that rely on manual branching or low-dimensional signals (e.g., proprioception), our vision-based Switcher uses eye-in-hand images (high-dimensional) to select among competing successor skill parts and to detect out-of-distribution contexts that require new demonstrations. We integrate kinesthetic teaching, joystick control, and hand gestures via an input-modality-abstraction layer and demonstrate that our proposed method is teaching modality-independent, enabling efficient in-situ recovery demonstrations. The system is validated in experiments on three challenging dexterous manipulation tasks. We evaluate our method under diverse conditions and furthermore conduct user studies with 8 participants. We show that the proposed method reliably performs branch selection and anomaly detection for novice users, achieving 90.7 % and 87.9 % accuracy, respectively, across 576 real-robot rollouts. We provide all code and data required to reproduce our experiments at this http URL.
87. 【2603.08055】Speed3R: Sparse Feed-forward 3D Reconstruction Models
链接:https://arxiv.org/abs/2603.08055
作者:Weining Ren,Xiao Tan,Kai Han
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:jointly inferring dense, inferring dense geometry, dense attention imposes, limits inference speed, severely limits inference
备注: CVPR 2026 Findings, project page: [this https URL](https://visual-ai.github.io/speed3r/)
点击查看摘要
Abstract:While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $\pi^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.
88. 【2603.08034】Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout
链接:https://arxiv.org/abs/2603.08034
作者:Jun Yu,Naixiang Zheng,Guoyuan Wang,Yunxiang Zhang,Lingsi Zhu,Jiaen Liang,Wei Huang,Shengping Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:severe class imbalance, Affective Behavior Analysis, Emotion recognition, partial occlusions, class imbalance
备注:
点击查看摘要
Abstract:Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.
89. 【2603.08030】QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration
链接:https://arxiv.org/abs/2603.08030
作者:Fengyang Xiao,Jingjia Feng,Peng Hu,Dingming Zhang,Lei Xu,Guanyi Qin,Lu Li,Chunming He,Sina Farsiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly challenging task, challenging task due, Real-world image restoration, clean ground-truth images, Real-world image
备注: 15 pages, 8 figures
点击查看摘要
Abstract:Real-world image restoration (RWIR) is a highly challenging task due to the absence of clean ground-truth images. Many recent methods resort to pseudo-label (PL) supervision, often within a Mean-Teacher (MT) framework. However, these methods face a critical paradox: unconditionally trusting the often imperfect, low-quality PLs forces the student model to learn undesirable artifacts, while discarding them severely limits data diversity and impairs model generalization. In this paper, we propose QualiTeacher, a novel framework that transforms pseudo-label quality from a noisy liability into a conditional supervisory signal. Instead of filtering, QualiTeacher explicitly conditions the student model on the quality of the PLs, estimated by an ensemble of complementary non-reference image quality assessment (NR-IQA) models spanning low-level distortion and semantic-level assessment. This strategy teaches the student network to learn a quality-graded restoration manifold, enabling it to understand what constitutes different quality levels. Consequently, it can not only avoid mimicking artifacts from low-quality labels but also extrapolate to generate results of higher quality than the teacher itself. To ensure the robustness and accuracy of this quality-driven learning, we further enhance the process with a multi-augmentation scheme to diversify the PL quality spectrum, a score-based preference optimization strategy inspired by Direct Preference Optimization (DPO) to enforce a monotonically ordered quality separation, and a cropped consistency loss to prevent adversarial over-optimization (reward hacking) of the IQA models. Experiments on standard RWIR benchmarks demonstrate that QualiTeacher can serve as a plug-and-play strategy to improve the quality of the existing pseudo-labeling framework, establishing a new paradigm for learning from imperfect supervision. Code will be released.
90. 【2603.08028】Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
链接:https://arxiv.org/abs/2603.08028
作者:Ashkan Taghipour,Morteza Ghahremani,Zinuo Li,Hamid Laga,Farid Boussaid,Mohammed Bennamoun
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:martial arts remains, arts remains challenging, Generating videos, martial arts, arts remains
备注:
点击查看摘要
Abstract:Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:
arXiv:2603.08028 [cs.CV]
(or
arXiv:2603.08028v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.08028
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Morteza Ghahremani [view email] [v1]
Mon, 9 Mar 2026 07:04:29 UTC (25,535 KB)
91. 【2603.08023】Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model
链接:https://arxiv.org/abs/2603.08023
作者:Sangjune Park,Inhyeok Choi,Donghyeon Soon,Youngwoo Jeon,Kyungdon Joo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Sound (cs.SD)
关键词:human motion characterized, virtual reality, expression and communication, content creation, form of human
备注: Accepted by WACV 2026
点击查看摘要
Abstract:Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{this https URL}.
92. 【2603.08021】AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
链接:https://arxiv.org/abs/2603.08021
作者:Xiaofei Wu,Yi Zhang,Yumeng Liu,Yuexin Ma,Yujiao Shi,Xuming He
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating human grasping, Generating human, natural hand-object interactions, accurately reflect, essential for natural
备注:
点击查看摘要
Abstract:Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand-object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.
93. 【2603.08020】VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion
链接:https://arxiv.org/abs/2603.08020
作者:Jing Li,Jing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating realistic cast, inserted foreground objects, complex scenes remains, scenes remains difficult, remains difficult due
备注: 12 pages,8 figures
点击查看摘要
Abstract:Generating realistic cast shadows for inserted foreground objects is a crucial yet challenging problem in image composition, where maintaining geometric consistency of shadow and object in complex scenes remains difficult due to the ill-posed nature of shadow formation. To address this issue, we propose VSDiffusion, a visibility-constrained two-stage framework designed to narrow the solution space by incorporating visibility priors. In Stage I, we predict a coarse shadow mask to localize plausible shadow generated regions. And in Stage II, conditional diffusion is performed guided by lighting and depth cues estimated from the composite to generate accurate shadows. In VSDiffusion, we inject visibility priors through two complementary pathways. First, a visibility control branch with shadow-gated cross attention that provides multi-scale structural guidance. Then, a learned soft prior map that reweights training loss in error-prone regions to enhance geometric correction. Additionally, we also introduce high-frequency guided enhancement module to sharpen boundaries and improve texture interaction with the background. Experiments on widely used public DESOBAv2 dataset demonstrated that our proposed VSDiffusion can generate accurate shadow, and establishes new SOTA results across most evaluation metrics.
94. 【2603.08018】Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared
链接:https://arxiv.org/abs/2603.08018
作者:Yafei Zhang,Meng Ma,Huafeng Li,Yu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:perception and security, vital for perception, methods rely, Joint Shared-dictionary Representation, Shared-dictionary Representation Learning
备注: This paper has been accepted by CVPR 2026
点击查看摘要
Abstract:Infrared-visible (IR-VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This encode-transfer-fuse-reconstruct pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary-coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference-fusion to tackle missing-IR fusion. The source code is publicly available at this https URL.
95. 【2603.08011】It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models
链接:https://arxiv.org/abs/2603.08011
作者:Jaeha Choi,Jin Won Lee,Siwoo You,Jangho Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, Advances in vision-language, reading analog clocks, multimodal reasoning tasks, complex multimodal reasoning
备注: Accepted to CVPR 2026 Findings
点击查看摘要
Abstract:Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectation, our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such data exhibit weak spatial-temporal reasoning, frequently confusing the hour and minute hands and struggling under common visual conditions such as occlusion, lighting variation, and cluttered backgrounds. To address this issue, we introduce TickTockVQA, a human-annotated dataset containing analog clocks in diverse real-world scenarios. TickTockVQA provides explicit hour and minute annotations, and includes an AM/PM tag when it is inferable from the visual context. Furthermore, we propose Swap-DPO, a direct preference optimization based fine-tuning framework to align model reasoning toward accurate time interpretation. Experimental results demonstrate that our approach substantially enhances clock reading accuracy and robustness under real-world conditions, establishing a foundation for future research on spatial-temporal reasoning and visual understanding in VLMs.
96. 【2603.08007】ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
链接:https://arxiv.org/abs/2603.08007
作者:Haoyu Tong,Xiangyu Dong,Xiaoguang Ma,Haoran Zhao,Yaoming Zhou,Chenghao Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:textual scene graphs, converts open-vocabulary detections, discrete textual scene, Existing aerial Vision-Language, aerial Vision-Language Navigation
备注: 8 pages
点击查看摘要
Abstract:Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial reasoning capabilities and inherent linguistic ambiguities. To address these bottlenecks, we propose a Visual-Spatial Reasoning (ViSA) enhanced framework for aerial VLN. Specifically, a triple-phase collaborative architecture is designed to leverage structured visual prompting, enabling Vision-Language Models (VLMs) to perform direct reasoning on image planes without the need for additional training or complex intermediate representations. Comprehensive evaluations on the CityNav benchmark demonstrate that the ViSA-enhanced VLN achieves a 70.3\% improvement in success rate compared to the fully trained state-of-the-art (SOTA) method, elucidating its great potential as a backbone for aerial VLN systems.
97. 【2603.07989】AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.07989
作者:Teng Wang,Yanting Lu,Ruize Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex human behaviors, large language models, model complex human, inherent reasoning capabilities, humam-populated environments
备注:
点击查看摘要
Abstract:We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents waypoints with point tokens as categorical and positional markers while encoding waypoint numerical values as corresponding point embeddings, seamlessly integrated into the LLM's space through a lightweight encoder-decoder architecture. This design preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.
98. 【2603.07988】amHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
链接:https://arxiv.org/abs/2603.07988
作者:Stefan Lionar,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multiagent Systems (cs.MA); Robotics (cs.RO)
关键词:Physics-based humanoid control, achieved remarkable progress, high-performing single-agent behaviors, cooperative human-object interaction, human-object interaction
备注: CVPR 2026. Project page: [this https URL](https://splionar.github.io/TeamHOI/) Code: [this https URL](https://github.com/sail-sg/TeamHOI)
点击查看摘要
Abstract:Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.
99. 【2603.07985】On the Feasibility and Opportunity of Autoregressive 3D Object Detection
链接:https://arxiv.org/abs/2603.07985
作者:Zanming Huang,Jinsu Yoo,Sooyoung Jeon,Zhenzhen Liu,Mark Campbell,Kilian Q Weinberger,Bharath Hariharan,Wei-Lun Chao,Katie Z Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:detectors typically rely, non-maximum suppression, limiting extensibility, object detectors typically, typically rely
备注: CVPR 2026 Findings Project Page: [this https URL](https://tzmhuang.github.io/autoreg3d/)
点击查看摘要
Abstract:LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.
100. 【2603.07981】Extend Your Horizon: A Device-Agnostic Surgical Tool Tracking Framework with Multi-View Optimization for Augmented Reality
链接:https://arxiv.org/abs/2603.07981
作者:Jiaming Zhang,Mingxu Liu,Hongchao Shu,Ruixing Liang,Yihao Liu,Ojas Taskar,Amir Kheradmand,Mehran Armand,Alejandro Martin-Gomez
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:relevant intraoperative information, visualize relevant intraoperative, intraoperative information, navigation provides real-time, real-time guidance
备注: accepted by IEEE VR 2026
点击查看摘要
Abstract:Surgical navigation provides real-time guidance by estimating the pose of patient anatomy and surgical instruments to visualize relevant intraoperative information. In conventional systems, instruments are typically tracked using fiducial markers and stationary optical tracking systems (OTS). Augmented reality (AR) has further enabled intuitive visualization and motivated tracking using sensors embedded in head-mounted displays (HMDs). However, most existing approaches rely on a clear line of sight, which is difficult to maintain in dynamic operating room environments due to frequent occlusions caused by equipment, surgical tools, and personnel. This work introduces a framework for tracking surgical instruments under occlusion by fusing multiple sensing modalities within a dynamic scene graph representation. The proposed approach integrates tracking systems with different accuracy levels and motion characteristics while estimating tracking reliability in real time. Experimental results demonstrate improved robustness and enhanced consistency of AR visualization in the presence of occlusions.
101. 【2603.07966】Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time
链接:https://arxiv.org/abs/2603.07966
作者:Weijie Zhou,Xuantang Xiong,Zhenlin Hu,Xiaomeng Zhu,Chaoyang Zhao,Honghui Dong,Zhengyou Zhang,Ming Tang,Jinqiao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:underspecified deictic commands, intentionally underspecified deictic, textbf, textit, situated collaboration
备注:
点击查看摘要
Abstract:In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.
102. 【2603.07961】SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation
链接:https://arxiv.org/abs/2603.07961
作者:Jiaye Feng,Qixiang Yin,Yuankun Liu,Tong Mo,Weiping Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scene Graph Generation, structures visual scenes, Large Language Models, Multimodal Large Language, Graph Generation
备注:
点击查看摘要
Abstract:Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
103. 【2603.07952】VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer
链接:https://arxiv.org/abs/2603.07952
作者:Yanning Hou,Peiyuan Li,Zirui Liu,Yitong Wang,Yanran Ruan,Jianfeng Qiu,Ke Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:target-class anomaly samples, requires detecting, detecting and localizing, localizing anomalies, anomalies without access
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Code: this https URL
104. 【2603.07937】$L^3$:Scene-agnostic Visual Localization in the Wild
链接:https://arxiv.org/abs/2603.07937
作者:Yu Zhang,Muhua Zhu,Yifei Xue,Tie Ji,Yizhen Lao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:methods typically require, Standard visual localization, localization methods typically, typically require offline, require offline pre-processing
备注:
点击查看摘要
Abstract:Standard visual localization methods typically require offline pre-processing of scenes to obtain 3D structural information for better performance. This inevitably introduces additional computational and time costs, as well as the overhead of storing scene representations. Can we visually localize in a wild scene without any off-line preprocessing step? In this paper, we leverage the online inference capabilities of feed-forward 3D reconstruction networks to propose a novel map-free visual localization framework $L^3$. Specifically, by performing direct online 3D reconstruction on RGB images, followed by two-stage metric scale recovery and pose refinement based on 2D-3D correspondences, $L^3$ achieves high accuracy without the need to pre-build or store any offline scene representations. Extensive experiments demonstrate $L^3$ not only that the performance is comparable to state-of-the-art solutions on various benchmarks, but also that it exhibits significantly superior robustness in sparse scenes (fewer reference images per scene).
105. 【2603.07936】xt to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis
链接:https://arxiv.org/abs/2603.07936
作者:Ethan Young,Zichun Wang,Aiden Taylor,Chance Jewell,Julian Myers,Satya Sri Rajiteswari Nimmagadda,Anthony White,Aniruddha Maiti,Ananya Jana
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diagrams, descriptions, teaching computer science, language model, teaching computer
备注: Accepted to ASEE North Central Section 2026
点击查看摘要
Abstract:Diagrams are widely used in teaching computer science courses. They are useful in subjects such as automata and formal languages, data structures, etc. These diagrams, often drawn by students during exams or assignments, vary in structure, layout, and correctness. This study examines whether current vision-language and large language models can process such diagrams and produce accurate textual and digital representations. In this study, scanned student-drawn diagrams are used as input. Then, textual descriptions are generated from these images using a vision-language model. The descriptions are checked and revised by human reviewers to make them accurate. Both the generated and the revised descriptions are then fed to a large language model to generate TikZ code. The resulting diagrams are compiled and then evaluated against the original scanned diagrams. We found descriptions generated directly from images using vision-language models are often incorrect and human correction can substantially improve the quality of vision language model generated descriptions. This research can help computer science education by paving the way for automated grading and feedback and creating more accessible instructional materials.
106. 【2603.07929】A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
链接:https://arxiv.org/abs/2603.07929
作者:Anh Duy Le,Van Linh Pham,Vinh Loi Ly,Nam Quan Nguyen,Huu Thang Nguyen,Tuan Anh Tran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mathematical expression recognition, expression recognition, mathematical expression, Hybrid Vision Transformer, crucial challenges
备注: Accepted as oral presentation at DICTA 2022
点击查看摘要
Abstract:One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated problem because of its two-dimensional structure and different symbol size. In this paper, we propose using a Hybrid Vision Transformer (HVT) with 2D positional encoding as the encoder to extract the complex relationship between symbols from the image. A coverage attention decoder is used to better track attention's history to handle the under-parsing and over-parsing problems. We also showed the benefit of using the [CLS] token of ViT as the initial embedding of the decoder. Experiments performed on the IM2LATEX-100K dataset have shown the effectiveness of our method by achieving a BLEU score of 89.94 and outperforming current state-of-the-art methods.
107. 【2603.07926】IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation
链接:https://arxiv.org/abs/2603.07926
作者:Sunghyun Baek(1),Jaemyung Yu(1),Seunghee Koh(1),Minsu Kim(2),Hyeonseong Jeon(2),Junmo Kim(1) ((1) Korea Advanced Institute of Science and Technology (KAIST), (2) LG Energy Solution)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:test data differ, prevent performance degradation, widely explored, explored to prevent, degradation when test
备注: ICLR 2026
点击查看摘要
Abstract:Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain-specific features rather than class-discriminative features. To address this, we propose a diversity maximization loss based on expert-input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state-of-the-art performance on various distribution-shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at this https URL.
108. 【2603.07920】RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving
链接:https://arxiv.org/abs/2603.07920
作者:Zhangshuo Qi,Jingyi Xu,Luqi Cheng,Shichen Wen,Guangming Xiong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:necessitates reliable localization, All-weather autonomy, autonomous driving, diverse scenarios, autonomy is critical
备注:
点击查看摘要
Abstract:All-weather autonomy is critical for autonomous driving, which necessitates reliable localization across diverse scenarios. While LiDAR place recognition is widely deployed for this task, its performance degrades in adverse weather. Conversely, radar-based methods, though weather-resilient, are hindered by the general unavailability of radar maps. To bridge this gap, radar-to-LiDAR place recognition, which localizes radar scans within existing LiDAR maps, has garnered increasing interest. However, extracting discriminative and generalizable features shared between modalities remains challenging, compounded by the scarcity of large-scale paired training data and the signal heterogeneity across radar types. In this work, we propose RLPR, a robust radar-to-LiDAR place recognition framework compatible with single-chip, scanning, and 4D radars. We first design a dual-stream network to extract structural features that abstract away from sensor-specific signal properties (e.g., Doppler or RCS). Subsequently, motivated by our task-specific asymmetry observation between radar and LiDAR, we introduce a two-stage asymmetric cross-modal alignment (TACMA) strategy, which leverages the pre-trained radar branch as a discriminative anchor to guide the alignment process. Experiments on four datasets demonstrate that RLPR achieves state-of-the-art recognition accuracy with strong zero-shot generalization capabilities.
109. 【2603.07918】Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning
链接:https://arxiv.org/abs/2603.07918
作者:Yingkai Zhang,Tao Zhang,Jing Nie,Ying Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unregistered hyperspectral image, hyperspectral image, typically aims, high-resolution reference image, Unregistered hyperspectral
备注:
点击查看摘要
Abstract:Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map. To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance. The code will be available at this https URL.
110. 【2603.07912】Geometric Transformation-Embedded Mamba for Learned Video Compression
链接:https://arxiv.org/abs/2603.07912
作者:Hao Wei,Yanhui Zhou,Chenyang Ge
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exhibited outstanding performance, requires explicit motion, explicit motion estimation, learned video compression, hybrid coding paradigm
备注:
点击查看摘要
Abstract:Although learned video compression methods have exhibited outstanding performance, most of them typically follow a hybrid coding paradigm that requires explicit motion estimation and compensation, resulting in a complex solution for video compression. In contrast, we introduce a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding. We first develop a cascaded Mamba module (CMM) with different embedded geometric transformations to effectively explore both long-range spatial and temporal dependencies. To improve local spatial representation, we introduce a locality refinement feed-forward network (LRFFN) that incorporates a hybrid convolution block based on difference convolutions. We integrate the proposed CMM and LRFFN into the encoder and decoder of our compression framework. Moreover, we present a conditional channel-wise entropy model that effectively utilizes conditional temporal priors to accurately estimate the probability distributions of current latent features. Extensive experiments demonstrate that our method outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints. Our source codes and models will be available at this https URL.
111. 【2603.07911】Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
链接:https://arxiv.org/abs/2603.07911
作者:Hui Liu,Kecheng Chen,Jialiang Wang,Xianming Liu,Wenya Wang,Haoliang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-Language Models, significantly advanced zero-shot, significantly advanced, zero-shot image recognition, advanced zero-shot image
备注: 19 pages, Accepted by CVPR 2026
点击查看摘要
Abstract:Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification. Our code is available at this https URL.
112. 【2603.07898】Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
链接:https://arxiv.org/abs/2603.07898
作者:Chen-Chen Zong,Yu-Qi Chi,Xie-Yang Wang,Yan Cui,Sheng-Jun Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Open-set active learning, previously unseen classes-a, unseen classes-a common, classes-a common challenge, Efficient Open-set Active
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Open-set active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes-a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning. In this paper, we propose E$^2$OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E$^2$OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity. Extensive experiments across multiple OSAL benchmarks demonstrate that E$^2$OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications. The code is available at this http URL.
113. 【2603.07895】MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models
链接:https://arxiv.org/abs/2603.07895
作者:Minsoo Lee,Jonghyun Kim,Juseung Yun,Sunwoo Yu,Jongseong Jang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale whole-slide images, foundation models learn, underlying molecular state, Pathology foundation models, models learn morphological
备注:
点击查看摘要
Abstract:Pathology foundation models learn morphological representations through self-supervised pretraining on large-scale whole-slide images, yet they do not explicitly capture the underlying molecular state of the tissue. Spatial transcriptomics technologies bridge this gap by measuring gene expression in situ, offering a natural cross-modal supervisory signal. We propose MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers. MINT appends a learnable ST token to the ViT input to encode transcriptomic information separately from the morphological CLS token, preventing catastrophic forgetting through DINO self-distillation and explicit feature anchoring to the frozen pretrained encoder. Gene expression regression at both spot-level (Visium) and patch-level (Xenium) resolutions provides complementary supervision across spatial scales. Trained on 577 publicly available HEST samples, MINT achieves the best overall performance on both HEST-Bench for gene expression prediction (mean Pearson r = 0.440) and EVA for general pathology tasks (0.803), demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.
114. 【2603.07890】Visualizing Coalition Formation: From Hedonic Games to Image Segmentation
链接:https://arxiv.org/abs/2603.07890
作者:Pedro Henrique de Paula França,Lucas Lopes Felipe,Daniel Sadoc Menasché
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:visual diagnostic testbed, hedonic games, visual diagnostic, diagnostic testbed, formation in hedonic
备注: The First Workshop on AI for Mechanism Design and Strategic Decision Making -- Workshop AIMS at ICLR 2026
点击查看摘要
Abstract:We propose image segmentation as a visual diagnostic testbed for coalition formation in hedonic games. Modeling pixels as agents on a graph, we study how a granularization parameter shapes equilibrium fragmentation and boundary structure. On the Weizmann single-object benchmark, we relate multi-coalition equilibria to binary protocols by measuring whether the converged coalitions overlap with a foreground ground-truth. We observe transitions from cohesive to fragmented yet recoverable equilibria, and finally to intrinsic failure under excessive fragmentation. Our core contribution links multi-agent systems with image segmentation by quantifying the impact of mechanism design parameters on equilibrium structures.
115. 【2603.07889】Structure and Progress Aware Diffusion for Medical Image Segmentation
链接:https://arxiv.org/abs/2603.07889
作者:Siyuan Song,Guyue Hu,Chenglong Li,Dengdi Sun,Zhe Jin,Jin Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical image segmentation, Medical image, carving fine boundaries, computer-aided diagnosis, crucial for computer-aided
备注:
点击查看摘要
Abstract:Medical image segmentation is crucial for computer-aided diagnosis, which necessitates understanding both coarse morphological and semantic structures, as well as carving fine boundaries. The morphological and semantic structures in medical images are beneficial and stable clues for target understanding. While the fine boundaries of medical targets (like tumors and lesions) are usually ambiguous and noisy since lesion overlap, annotation uncertainty, and so on, making it not reliable to serve as early supervision. However, existing methods simultaneously learn coarse structures and fine boundaries throughout the training process. In this paper, we propose a structure and progress-aware diffusion (SPAD) for medical image segmentation, which consists of a semantic-concentrated diffusion (ScD) and a boundary-centralized diffusion (BcD) modulated by a progress-aware scheduler (PaS). Specifically, the semantic-concentrated diffusion introduces anchor-preserved target perturbation, which perturbs pixels within a medical target but preserves unaltered areas as semantic anchors, encouraging the model to infer noisy target areas from the surrounding semantic context. The boundary-centralized diffusion introduces progress-aware boundary noise, which blurs unreliable and ambiguous boundaries, thus compelling the model to focus on coarse but stable anatomical morphology and global semantics. Furthermore, the progress-aware scheduler gradually modulates noise intensity of the ScD and BcD forming a coarse-to-fine diffusion paradigm, which encourage focusing on coarse morphological and semantic structures during early target understanding stages and gradually shifting to fine target boundaries during later contour adjusting stages.
116. 【2603.07888】VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?
链接:https://arxiv.org/abs/2603.07888
作者:Minkyu Kim,Sangheon Lee,Dongmin Park
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:visually similar images, industrial anomaly detection, distinguish subtle differences, anomaly detection, ability to distinguish
备注: ICLR 2026
点击查看摘要
Abstract:The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.
117. 【2603.07874】oward Unified Multimodal Representation Learning for Autonomous Driving
链接:https://arxiv.org/abs/2603.07874
作者:Ximeng Tao,Dimitar Filev,Gaurav Pandey
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Contrastive Language-Image Pre-training, shown impressive performance, textual representations, shown impressive, visual and textual
备注:
点击查看摘要
Abstract:Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.
118. 【2603.07865】SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving
链接:https://arxiv.org/abs/2603.07865
作者:Ayush Barik,Sofia Stoica,Nikhil Sarda,Arnav Kethana,Abhinav Khanduja,Muchen Xu,Fan Lai
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
关键词:models produce high-fidelity, diffusion models produce, incurring multi-second latency, produce high-fidelity audio, function evaluations
备注: Submitted to INTERSPEECH 2026
点击查看摘要
Abstract:Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.
119. 【2603.07839】raining-free Temporal Object Tracking in Surgical Videos
链接:https://arxiv.org/abs/2603.07839
作者:Subhadeep Koley,Abdolrahim Kadkhodamohammadi,Santiago Barbarisi,Danail Stoyanov,Imanol Luengo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:critical anatomical structures, laparoscopic cholecystectomy, structures and instruments, critical anatomical, anatomical structures
备注: Accepted in IPCAI 2025
点击查看摘要
Abstract:Purpose: In this paper, we present a novel approach for online object tracking in laparoscopic cholecystectomy (LC) surgical videos, targeting localisation and tracking of critical anatomical structures and instruments. Our method addresses the challenges of costly pixel-level annotations and label inconsistencies inherent in existing datasets. Methods: Leveraging the inherent object localisation capabilities of pre-trained text-to-image diffusion models, we extract representative features from surgical frames without any training or fine-tuning. Our tracking framework uses these features, along with cross-frame interactions via an affinity matrix inspired by query-key-value attention, to ensure temporal continuity in the tracking process. Results: Through a pilot study, we first demonstrate that diffusion features exhibit superior object localisation and consistent semantics across different decoder levels and temporal frames. Later, we perform extensive experiments to validate the effectiveness of our approach, showcasing its superiority over competitors for the task of temporal object tracking. Specifically, we achieve a per-pixel classification accuracy of 79.19%, mean Jaccard Score of 56.20%, and mean F-Score of 79.48% on the publicly available CholeSeg8K dataset. Conclusion: Our work not only introduces a novel application of text-to-image diffusion models but also contributes to advancing the field of surgical video analysis, offering a promising avenue for accurate and cost-effective temporal object tracking in minimally invasive surgery videos.
Comments:
Accepted in IPCAI 2025
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.07839 [cs.CV]
(or
arXiv:2603.07839v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07839
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Int J CARS 20, 1067-1075 (2025)
Related DOI:
https://doi.org/10.1007/s11548-025-03349-6
Focus to learn more
DOI(s) linking to related resources
Submission history From: Subhadeep Koley [view email] [v1]
Sun, 8 Mar 2026 23:09:16 UTC (40,946 KB)
120. 【2603.07832】GazeShift: Unsupervised Gaze Estimation and Dataset for VR
链接:https://arxiv.org/abs/2603.07832
作者:Gil Shapira,Ishay Goldin,Evgeny Artyomov,Donghoon Kim,Yosi Keller,Niv Zehngut
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern virtual reality, virtual reality, Gaze estimation, Gaze, modern virtual
备注: Accepted to CVPR26
点击查看摘要
Abstract:Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at this https URL.
121. 【2603.07831】ransferable Optimization Network for Cross-Domain Image Reconstruction
链接:https://arxiv.org/abs/2603.07831
作者:Yunmei Chen,Chi Ding,Xiaojing Ye
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
关键词:tackle the challenge, image reconstruction problems, data, transfer learning framework, reconstruction problems
备注: 30 pages, 7 figures
点击查看摘要
Abstract:We develop a novel transfer learning framework to tackle the challenge of limited training data in image reconstruction problems. The proposed framework consists of two training steps, both of which are formed as bi-level optimizations. In the first step, we train a powerful universal feature-extractor that is capable of learning important knowledge from large, heterogeneous data sets in various domains. In the second step, we train a task-specific domain-adapter for a new target domain or task with only a limited amount of data available for training. Then the composition of the adapter and the universal feature-extractor effectively explores feature which serve as an important component of image regularization for the new domains, and this leads to high-quality reconstruction despite the data limitation issue. We apply this framework to reconstruct under-sampled MR images with limited data by using a collection of diverse data samples from different domains, such as images of other anatomies, measurements of various sampling ratios, and even different image modalities, including natural images. Experimental results demonstrate a promising transfer learning capability of the proposed method.
122. 【2603.07819】Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
链接:https://arxiv.org/abs/2603.07819
作者:Mridankan Mandal
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:sustainable livestock management, real world monitoring, CSIRO Pasture Biomass, Accurate estimation, annotated datasets typical
备注:
点击查看摘要
Abstract:Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 - DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.
123. 【2603.07817】racking Phenological Status and Ecological Interactions in a Hawaiian Cloud Forest Understory using Low-Cost Camera Traps and Visual Foundation Models
链接:https://arxiv.org/abs/2603.07817
作者:Luke Meyers,Anirudh Potlapally,Yuyan Chen,Mike Long,Tanya Berger-Wolf,Hari Subramoni,Remi Megret,Daniel Rubenstein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:wide ecological impacts, broadly understudied, Pu'u Maka'ala Natural, Maka'ala Natural Area, Natural Area Reserve
备注:
点击查看摘要
Abstract:Plant phenology, the study of cyclical events such as leafing out, flowering, or fruiting, has wide ecological impacts but is broadly understudied, especially in the tropics. Image analysis has greatly enhanced remote phenological monitoring, yet capturing phenology at the individual level remains challenging. In this project, we deployed low-cost, animal-triggered camera traps at the Pu'u Maka'ala Natural Area Reserve in Hawaii to simultaneously document shifts in plant phenology and flora-faunal interactions. Using a combination of foundation vision models and traditional computer vision methods, we measure phenological trends from images comparable to on-the-ground observations without relying on supervised learning techniques. These temporally fine-grained phenology measurements from camera-trap images uncover trends that coarser traditional sampling fails to detect. When combined with detailed visitation data detected from images, these trends can begin to elucidate drivers of both plant phenology and animal ecology.
124. 【2603.07815】HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration
链接:https://arxiv.org/abs/2603.07815
作者:Desen Sun,Jason Hon,Jintao Zhang,Sihang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:demonstrated a remarkable, remarkable ability, model, large model, generation applications
备注:
点击查看摘要
Abstract:Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.
125. 【2603.07799】MWM: Mobile World Models for Action-Conditioned Consistent Prediction
链接:https://arxiv.org/abs/2603.07799
作者:Han Yan,Zishang Xiang,Zeyu Zhang,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:future predicted space, imagined future predicted, World models enable, predicted space, offering a promising
备注:
点击查看摘要
Abstract:World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: this https URL. Website: this https URL.
126. 【2603.07794】4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera
链接:https://arxiv.org/abs/2603.07794
作者:David Ninfa,Andras Palffy,Holger Caesar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:driving requires robust, occupancy prediction remains, Autonomous driving requires, semantic occupancy prediction, weather and lighting
备注:
点击查看摘要
Abstract:Autonomous driving requires robust perception across diverse environmental conditions, yet 3D semantic occupancy prediction remains challenging under adverse weather and lighting. In this work, we present the first study combining 4D radar and camera data for 3D semantic occupancy prediction. Our fusion leverages the complementary strengths of both modalities: 4D radar provides reliable range, velocity, and angle measurements in challenging conditions, while cameras contribute rich semantic and texture information. We further show that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy. Additionally, we introduce a fully automatically labeled dataset for training semantic occupancy models, substantially reducing reliance on costly manual annotation. Experiments demonstrate the robustness of 4D radar across diverse scenarios, highlighting its potential to advance autonomous vehicle perception.
127. 【2603.07789】SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
链接:https://arxiv.org/abs/2603.07789
作者:Zixuan Pan,Kaiyuan Tang,Jun Xia,Yifan Qin,Lin Gu,Chaoli Wang,Jianxu Chen,Yiyu Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Splatting has emerged, Gaussian Splatting, support efficient rendering, low-end devices, rendering on low-end
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on low-end devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5x compression over prior non-quantized 2D Gaussian methods and 1.6x over quantized ones, while also delivering 1.6x and 6.5x faster optimization, respectively, without degrading, and often improving, image fidelity. Code is available at this https URL.
128. 【2603.07786】OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models
链接:https://arxiv.org/abs/2603.07786
作者:Yusuke Tozaki,Hisashi Miyamori
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:show clear gaps, track relative positions, ordinal number understanding, number understanding, large indices
备注: Accepted as a Short Paper at VISAPP 2026
点击查看摘要
Abstract:Vision-Language Models (VLMs) have advanced across multimodal benchmarks but still show clear gaps in ordinal number understanding, i.e., the ability to track relative positions and generalize to large indices. We present OrdinalBench, a diagnostic benchmark that standardizes ordinal number understanding as an evaluation task for VLMs. The core task is N-th object identification, defined by a starting reference and traversal rule. Task difficulty is controlled along three axes: (i) ordinal magnitude, from small numbers to extreme cases up to 300; (ii) arrangement complexity, from single loops to maze-like paths; and (iii) object count. The benchmark provides 39,000 question-answer pairs, each annotated with a ground-truth reasoning trajectory and balanced across difficulty levels for controlled large-scale testing. Beyond answer-only evaluation, our framework requires models to generate structured stepwise traces of the counting process and provides an open evaluation toolkit that measures both final accuracy and step-level path consistency. Zero-shot evaluations of GPT-5, Gemini 2.5 Flash Lite, Qwen2.5-VL, InternVL3.5, and Molmo reveal sharp degradation under large-ordinal and complex-path conditions, highlighting weak generalization despite strong scores on standard multimodal tasks. By framing ordinal number understanding as a core target, OrdinalBench provides a reproducible benchmark and diagnostic framework for developing VLMs with stronger sequential reasoning. All data and code are available at this https URL
129. 【2603.07776】Parameterized Brushstroke Style Transfer
链接:https://arxiv.org/abs/2603.07776
作者:Uma Meleti,Siyu Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Computer Vision-based Style, Computer Vision-based, Vision-based Style Transfer, Style Transfer techniques, Vision-based Style
备注:
点击查看摘要
Abstract:Computer Vision-based Style Transfer techniques have been used for many years to represent artistic style. However, most contemporary methods have been restricted to the pixel domain; in other words, the style transfer approach has been modifying the image pixels to incorporate artistic style. However, real artistic work is made of brush strokes with different colors on a canvas. Pixel-based approaches are unnatural for representing these images. Hence, this paper discusses a style transfer method that represents the image in the brush stroke domain instead of the RGB domain, which has better visual improvement over pixel-based methods.
130. 【2603.07774】Geometric Knowledge-Assisted Federated Dual Knowledge Distillation Approach Towards Remote Sensing Satellite Imagery
链接:https://arxiv.org/abs/2603.07774
作者:Luyao Zou,Fei Pan,Jueying Li,Yan Kyaw Tun,Apurba Adhikary,Zhu Han,Hayoung Oh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:analyzing remote sensing, sensing satellite imagery, remote sensing satellite, Federated Dual Knowledge, Federated learning
备注: 16 pages, 9 figures
点击查看摘要
Abstract:Federated learning (FL) has recently become a promising solution for analyzing remote sensing satellite imagery (RSSI). However, the large scale and inherent data heterogeneity of images collected from multiple satellites, where the local data distribution of each satellite differs from the global one, present significant challenges to effective model training. To address this issue, we propose a Geometric Knowledge-Guided Federated Dual Knowledge Distillation (GK-FedDKD) framework for RSSI analysis. In our approach, each local client first distills a teacher encoder (TE) from multiple student encoders (SEs) trained with unlabeled augmented data. The TE is then connected with a shared classifier to form a teacher network (TN) that supervises the training of a new student network (SN). The intermediate representations of the TN are used to compute local covariance matrices, which are aggregated at the server to generate global geometric knowledge (GGK). This GGK is subsequently employed for local embedding augmentation to further guide SN training. We also design a novel loss function and a multi-prototype generation pipeline to stabilize the training process. Evaluation over multiple datasets showcases that the proposed GK-FedDKD approach is superior to the considered state-of-the-art baselines, e.g., the proposed approach with the Swin-T backbone surpasses previous SOTA approaches by an average 68.89% on the EuroSAT dataset.
131. 【2603.07769】MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations
链接:https://arxiv.org/abs/2603.07769
作者:Jiyao Liu,Junzhi Ning,Chenglong Ma,Wanying Qu,Jianghan Shen,Siqi Luo,Jinjie Wei,Jin Ye,Pengze Li,Tianbin Li,Jiashi Lin,Hongming Shan,Xinzhe Luo,Xiaohong Liu,Lihao Liu,Junjun He,Ningsheng Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, images inevitably suffer, face critical challenges, medical images inevitably, real-world clinical environments
备注: 29 pages, 11 figures
点击查看摘要
Abstract:Despite impressive performance on standard benchmarks, multimodal large language models (MLLMs) face critical challenges in real-world clinical environments where medical images inevitably suffer various quality degradations. Existing benchmarks exhibit two key limitations: (1) absence of large-scale, multidimensional assessment across medical image quality gradients and (2) no systematic confidence calibration analysis. To address these gaps, we present MedQ-Deg, a comprehensive benchmark for evaluating medical MLLMs under image quality degradations. MedQ-Deg provides multi-dimensional evaluation spanning 18 distinct degradation types, 30 fine-grained capability dimensions, and 7 imaging modalities, with 24,894 question-answer pairs. Each degradation is implemented at 3 severity degrees, calibrated by expert radiologists. We further introduce Calibration Shift metric, which quantifies the gap between a model's perceived confidence and actual performance to assess metacognitive reliability under degradation. Our comprehensive evaluation of 40 mainstream MLLMs reveals several critical findings: (1) overall model performance degrades systematically as degradation severity increases, (2) models universally exhibit the AI Dunning-Kruger Effect, maintaining inappropriately high confidence despite severe accuracy collapse, and (3) models display markedly differentiated behavioral patterns across capability dimensions, imaging modalities, and degradation types. We hope MedQ-Deg drives progress toward medical MLLMs that are robust and trustworthy in real clinical practice.
132. 【2603.07759】DECADE: A Temporally-Consistent Unsupervised Diffusion Model for Enhanced Rb-82 Dynamic Cardiac PET Image Denoising
链接:https://arxiv.org/abs/2603.07759
作者:Yinchi Zhou,Liang Guo,Huidong Xie,Yuexi Du,Ashley Wang,Menghua Xia,Tian Yu,Ramesh Fazzone-Chettiar,Christopher Weyman,Bruce Spottiswoode,Vladimir Panin,Kuangyu Shi,Edward J. Miller,Attila Feher,Albert J. Sinusas,Nicha C. Dvornek,Chi Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:coronary artery disease, short half-life results, cardiac PET imaging, high noise levels, dynamic cardiac PET
备注:
点击查看摘要
Abstract:Rb-82 dynamic cardiac PET imaging is widely used for the clinical diagnosis of coronary artery disease (CAD), but its short half-life results in high noise levels that degrade dynamic frame quality and parametric imaging. The lack of paired clean-noisy training data, rapid tracer kinetics, and frame-dependent noise variations further limit the effectiveness of existing deep learning denoising methods. We propose DECADE (A Temporally-Consistent Unsupervised Diffusion model for Enhanced Rb-82 CArdiac PET DEnoising), an unsupervised diffusion framework that generalizes across early- to late-phase dynamic frames. DECADE incorporates temporal consistency during both training and iterative sampling, using noisy frames as guidance to preserve quantitative accuracy. The method was trained and evaluated on datasets acquired from Siemens Vision 450 and Siemens Biograph Vision Quadra scanners. On the Vision 450 dataset, DECADE consistently produced high-quality dynamic and parametric images with reduced noise while preserving myocardial blood flow (MBF) and myocardial flow reserve (MFR). On the Quadra dataset, using 15%-count images as input and full-count images as reference, DECADE outperformed UNet-based and other diffusion models in image quality and K1/MBF quantification. The proposed framework enables effective unsupervised denoising of Rb-82 dynamic cardiac PET without paired training data, supporting clearer visualization while maintaining quantitative integrity.
133. 【2603.07758】AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
链接:https://arxiv.org/abs/2603.07758
作者:Teng Yan,Yihan Liu,Jiongxu Chen,Teng Wang,Jiaqi Li,Bingzhuo Zhong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:framewise referring pipelines, referring pipelines drift, Long-term language-guided referring, videos is challenging, drift as re-identification
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.
134. 【2603.07751】3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
链接:https://arxiv.org/abs/2603.07751
作者:Shaoxiong Zhan,Yanlin Lai,Zheng Liu,Hai Lin,Shen Li,Xiaodong Cai,Zijian Lin,Wen Huang,Hai-Tao Zheng
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Current Large Language, Large Language Models, achieved Olympiad-level logic, Current Large, Large Language
备注:
点击查看摘要
Abstract:Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.
135. 【2603.07704】PARSE: Part-Aware Relational Spatial Modeling
链接:https://arxiv.org/abs/2603.07704
作者:Yinuo Bai,Peijun Xu,Kuixiang Shao,Yuyang Jiao,Jingxuan Zhang,Kaixin Yao,Jiayuan Gu,Jingyi Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Inter-object relations underpin, underpin spatial intelligence, Part-centric Assembly Graph, Inter-object relations, existing representations
备注:
点击查看摘要
Abstract:Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.
136. 【2603.07700】DM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward
链接:https://arxiv.org/abs/2603.07700
作者:Yihong Luo,Tianyang Hu,Weijian Luo,Jing Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:significantly lower cost, enabled powerful image, Trajectory Distribution Matching, lower cost, unsolved problem
备注: [this https URL](https://luo-yihong.github.io/TDM-R1-Page/)
点击查看摘要
Abstract:While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: this https URL
137. 【2603.07697】Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation
链接:https://arxiv.org/abs/2603.07697
作者:Junkun Jiang,Jie Chen,Ho Yin Au,Jingyu Xiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-based motion capture, critical joint information, motion capture solutions, Vision-based motion, struggle with occlusions
备注: Accepted by IEEE Transactions on Multimedia. Supplementary material is included
点击查看摘要
Abstract:Vision-based motion capture solutions often struggle with occlusions, which result in the loss of critical joint information and hinder accurate 3D motion reconstruction. Other wearable alternatives also suffer from noisy or unstable data, often requiring extensive manual cleaning and correction to achieve reliable results. To address these challenges, we introduce the Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture. Central to our design is the Kinematic Attention Aggregation (KAA) mechanism, which enables efficient, deep, and iterative encoding of both joint-level and pose-level features, capturing structural and temporal motion patterns essential for task-specific reconstruction. We focus on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture, where each learned prior emphasizes different aspects of motion dynamics and is specifically efficient for its corresponding task. This enables the architecture to adaptively specialize without altering its structure. Such versatility allows MMDM to efficiently learn motion priors tailored to scenarios such as motion refinement, completion, and in-betweening. Extensive evaluations on public benchmarks demonstrate that MMDM achieves strong performance across diverse masking strategies and task settings. The source code is available at this https URL.
138. 【2603.07694】Compressed-Domain-Aware Online Video Super-Resolution
链接:https://arxiv.org/abs/2603.07694
作者:Yuhang Wang,Hai Li,Shujuan Hou,Zhetao Dong,Xiaoyao Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:online video streaming, bandwidth-limited online video, online VSR, downsampled and compressed, online video super-resolution
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at this https URL.
139. 【2603.07691】RoboPCA: Pose-centered Affordance Learning from Human Demonstrations for Robot Manipulation
链接:https://arxiv.org/abs/2603.07691
作者:Zhanqi Xiao,Ruiping Wang,Xilin Chen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding spatial affordances, accomplish diverse tasks, Understanding spatial, contact regions, effectively manipulate objects
备注: Accepted to ICRA 2026
点击查看摘要
Abstract:Understanding spatial affordances -- comprising the contact regions of object interaction and the corresponding contact poses -- is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.
140. 【2603.07690】FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT
链接:https://arxiv.org/abs/2603.07690
作者:Zhisong Xu,Takeshi Oishi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Geometry Transformers, enable strong online, unbounded KV-cache growth, Streaming Visual Geometry, StreamVGGT enable strong
备注: 24pages including appendix
点击查看摘要
Abstract:Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception but suffer from unbounded KV-cache growth, which limits deployment over long streams. We revisit bounded-memory streaming from the perspective of geometric support. In geometry-driven reasoning, memory quality depends not only on how many tokens are retained, but also on whether the retained memory still preserves sufficiently coherent local support. This suggests that token-level retention may become less suitable under fixed budgets, as it can thin the evidence available within each contributing frame and make subsequent fusion more sensitive to weakly aligned history. Motivated by this observation, we propose FrameVGGT, a frame-driven rolling explicit-memory framework that treats each frame's incremental KV contribution as a coherent evidence block. FrameVGGT summarizes each block into a compact prototype and maintains a fixed-capacity mid-term bank of complementary frame blocks under strict budgets, with an optional lightweight anchor tier for rare prolonged degradation. Across long-sequence 3D reconstruction, video depth estimation, and camera pose benchmarks, FrameVGGT achieves favorable accuracy--memory trade-offs under bounded memory, while maintaining more stable geometry over long streams.
141. 【2603.07686】UniUncer: Unified Dynamic Static Uncertainty for End to End Driving
链接:https://arxiv.org/abs/2603.07686
作者:Yu Gao,Jijun Wang,Zongzheng Zhang,Anqing Jiang,Yiru Wang,Yuwen Heng,Shuo Wang,Hao Sun,Zhangfeng Hu,Hao Zhao
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:single learnable pipeline, avoiding hand-engineered modules, academic research, offering a single, industry deployment
备注: ICRA 2026
点击查看摘要
Abstract:End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.
142. 【2603.07667】FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration
链接:https://arxiv.org/abs/2603.07667
作者:Congcong Bian,Haolong Ma,Hui Li,Zhongwei Shen,Xiaoqing Luo,Xiaoning Song,Xiao-Jun Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Spatial registration, real-world perception, multi-modality image fusion, critical but formidable, formidable step
备注:
点击查看摘要
Abstract:Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although several methods are proposed to address this issue, the existing registration-based fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross-modality registration method guided by visual priors is proposed for infrared and visible image fusion task, termed FusionRegister. Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions. Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectively handled, enabling seamless integration with diverse fusion methods while preserving their intrinsic properties. In addition, its efficiency is further enhanced by serving the backbone fusion method as a natural visual prior provider, which guides the registration process to focus only on mismatch regions, thereby avoiding redundant operations. Extensive experiments on three datasets demonstrate that FusionRegister not only inherits the fusion quality of state-of-the-art methods, but also delivers superior detail alignment and robustness, making it highly suitable for infrared and visible image fusion method. The code will be available at this https URL.
143. 【2603.07664】Ref-DGS: Reflective Dual Gaussian Splatting
链接:https://arxiv.org/abs/2603.07664
作者:Ningjing Fan,Yiqun Wang,Dongming Yan,Peter Wonka
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:typically near-field specular, near-field specular reflections, accurate surface reconstruction, poses a fundamental, view synthesis
备注: Project page: [this https URL](https://straybirdflower.github.io/Ref-DGS/)
点击查看摘要
Abstract:Reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present Ref-DGS, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware adaptive mixing shader that fuses global and local reflection features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.
144. 【2603.07660】Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence
链接:https://arxiv.org/abs/2603.07660
作者:Yuanyuan Gao,Hao Li,Yifei Liu,Xinhao Ji,Yuning Gong,Yuanjun Liao,Fangfu Liu,Manyuan Zhang,Yuchen Yang,Dan Xu,Xue Yang,Huaxi Huang,Hongjie Zhang,Ziwei Liu,Xiao Sun,Dingwen Zhang,Zhihang Zhong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligence fundamentally relies, spatial intelligence fundamentally, intelligence fundamentally, fundamentally relies, relies on access
备注: project page: [this https URL](https://visionary-laboratory.github.io/holi-spatial/)
点击查看摘要
Abstract:The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.
Comments:
project page: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.07660 [cs.CV]
(or
arXiv:2603.07660v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07660
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yuanyuan Gao [view email] [v1]
Sun, 8 Mar 2026 14:49:20 UTC (30,827 KB)
145. 【2603.07659】Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
链接:https://arxiv.org/abs/2603.07659
作者:Kaihua Tang,Jiaxin Qi,Jinli Ou,Yuhua Zheng,Jianqiang Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, Large Language Models, driven rapid progress, emergence of Large, development of Large
备注: Accepted to CVPR 2026. Code: [this https URL](https://github.com/KaihuaTang/Self-Critical-Inference-Framework)
点击查看摘要
Abstract:The emergence of Large Language Models (LLMs) has driven rapid progress in multi-modal learning, particularly in the development of Large Vision-Language Models (LVLMs). However, existing LVLM training paradigms place excessive reliance on the LLM component, giving rise to two critical robustness challenges: language bias and language sensitivity. To address both issues simultaneously, we propose a novel Self-Critical Inference (SCI) framework that extends Visual Contrastive Decoding by conducting multi-round counterfactual reasoning through both textual and visual perturbations. This process further introduces a new strategy for improving robustness by scaling the number of counterfactual rounds. Moreover, we also observe that failure cases of LVLMs differ significantly across models, indicating that fixed robustness benchmarks may not be able to capture the true reliability of LVLMs. To this end, we propose the Dynamic Robustness Benchmark (DRBench), a model-specific evaluation framework targeting both language bias and sensitivity issues. Extensive experiments show that SCI consistently outperforms baseline methods on DRBench, and that increasing the number of inference rounds further boosts robustness beyond existing single-step counterfactual reasoning methods.
146. 【2603.07652】GLASS: Graph and Vision-Language Assisted Semantic Shape Correspondence
链接:https://arxiv.org/abs/2603.07652
作者:Qinfeng Xiao,Guofeng Mei,Qilong Liu,Chenyuan Yi,Fabio Poiesi,Jian Zhang,Bo Yang,Yick Kit-lun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Establishing dense correspondence, including texture transfer, Establishing dense, fundamental downstream tasks, shape interpolation
备注:
点击查看摘要
Abstract:Establishing dense correspondence across 3D shapes is crucial for fundamental downstream tasks, including texture transfer, shape interpolation, and robotic manipulation. However, learning these mappings without manual supervision remains a formidable challenge, particularly under severe non-isometric deformations and in inter-class settings where geometric cues are ambiguous. Conventional functional map methods, while elegant, typically struggle in these regimes due to their reliance on isometry. To address this, we present GLASS, a framework that bridges the gap by integrating geometric spectral analysis with rich semantic priors from vision-language foundation models. GLASS introduces three key innovations: (i) a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; (ii) the injection of language embeddings into vertex descriptors via zero-shot 3D segmentation, capturing high-level part semantics; and (iii) a graph-assisted contrastive loss that enforces structural consistency between regions (e.g., source's head'' $\leftrightarrow$ target's head'') by leveraging geodesic and topological relationships between regions. This design allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision. Extensive experiments demonstrate that GLASS achieves state-of-the-art performance across all regimes, maintaining high accuracy on standard near-isometric tasks while significantly advancing performance in challenging settings. Specifically, it achieves average geodesic errors of 0.21, 4.5, and 5.6 on the inter-class benchmark SNIS and non-isometric benchmarks SMAL and TOPKIDS, reducing errors from URSSM baselines of 0.49, 6.0, and 8.9 by 57%, 25%, and 37%, respectively.
147. 【2603.07648】AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots
链接:https://arxiv.org/abs/2603.07648
作者:Likui Zhang,Tao Tang,Zhihao Zhan,Xiuwei Chen,Zisheng Chen,Jianhua Han,Jiangtong Zhu,Pei Xu,Hang Xu,Hefeng Wu,Liang Lin,Xiaodan Liang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown promising potential, Recent advances, shown promising, promising potential, robotic manipulation tasks
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $\pi_{0}$ by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms $\pi_{0}$ and $\pi_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{this https URL}{here}.
148. 【2603.07645】Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics
链接:https://arxiv.org/abs/2603.07645
作者:Abdeldjalil Taibi,Mohmoud Badlis,Amina Bensalem,Belkacem Zouilekh,Mohammed Brahimi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Efficient luggage trolley, ensuring asset availability, Efficient luggage, luggage trolley management, management is critical
备注:
点击查看摘要
Abstract:Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict security and privacy regulations limit large-scale data collection. Second, existing public datasets lack the diversity, scale, and annotation quality needed to handle dense, overlapping trolley arrangements typical of real-world operations. To address these limitations, we introduce a synthetic data generation pipeline based on a high-fidelity Digital Twin of Algiers International Airport using NVIDIA Omniverse. The pipeline produces richly annotated data with oriented bounding boxes, capturing complex trolley formations, including tightly nested chains. We evaluate YOLO-OBB using five training strategies: real-only, synthetic-only, linear probing, full fine-tuning, and mixed training. This allows us to assess how synthetic data can complement limited real-world annotations. Our results show that mixed training with synthetic data and only 40 percent of real annotations matches or exceeds the full real-data baseline, achieving 0.94 mAP@50 and 0.77 mAP@50-95, while reducing annotation effort by 25 to 35 percent. Multi-seed experiments confirm strong reproducibility with a standard deviation below 0.01 on mAP@50, demonstrating the practical effectiveness of synthetic data for automated trolley detection.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.07645 [cs.CV]
(or
arXiv:2603.07645v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07645
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
149. 【2603.07630】Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation
链接:https://arxiv.org/abs/2603.07630
作者:Jinyu Liu,Gaoyang Zhang,Yang Zhou,Ruoyi Hao,Yang Zhang,Hongliang Ren
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ensure patient safety, Nasotracheal intubation, emergency airway management, accurate glottis detection, airway management
备注: 15 pages, 7 figures
点击查看摘要
Abstract:Nasotracheal intubation (NTI) is a vital procedure in emergency airway management, where rapid and accurate glottis detection is essential to ensure patient safety. However, existing machine assisted visual detection systems often rely on high performance computational resources and suffer from significant inference delays, which limits their applicability in time critical and resource constrained scenarios. To overcome these limitations, we propose Mobile GlottisNet, a lightweight and efficient glottis detection framework designed for real time inference on embedded and edge devices. The model incorporates structural awareness and spatial alignment mechanisms, enabling robust glottis localization under complex anatomical and visual conditions. We implement a hierarchical dynamic thresholding strategy to enhance sample assignment, and introduce an adaptive feature decoupling module based on deformable convolution to support dynamic spatial reconstruction. A cross layer dynamic weighting scheme further facilitates the fusion of semantic and detail features across multiple scales. Experimental results demonstrate that the model, with a size of only 5MB on both our PID dataset and Clinical datasets, achieves inference speeds of over 62 FPS on devices and 33 FPS on edge platforms, showing great potential in the application of emergency NTI.
150. 【2603.07625】Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding
链接:https://arxiv.org/abs/2603.07625
作者:Shumeng Li,Jintao Guo,Jian Zhang,Yulin Zhou,Luyang Cao,Yinghuan Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:practical brain-computer interfaces, reconstruct visual experiences, Cross-subject visual decoding, visual decoding aims, brain-computer interfaces
备注:
点击查看摘要
Abstract:Cross-subject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Duala effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction. Our code is available at this https URL.
151. 【2603.07619】Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
链接:https://arxiv.org/abs/2603.07619
作者:Abin Shoby,Ta Duc Huy,Tuan Dung Nguyen,Minh Khoi Ho,Qi Chen,Anton van den Hengel,Phi Le Nguyen,Johan W. Verjans,Vu Minh Hieu Phan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language models, Vision Language, hallucinate non-existent objects, Language models, hallucinate non-existent
备注: CVPR2026 Findings
点击查看摘要
Abstract:Vision Language models (VLMs) often hallucinate non-existent objects. Detecting hallucination is analogous to detecting deception: a single final statement is insufficient, one must examine the underlying reasoning process. Yet existing detectors rely mostly on final-layer signals. Attention-based methods assume hallucinated tokens exhibit low attention, while entropy-based ones use final-step uncertainty. Our analysis reveals the opposite: hallucinated objects can exhibit peaked attention due to contextual priors; and models often express high confidence because intermediate layers have already converged to an incorrect hypothesis. We show that the key to hallucination detection lies within the model's thought process, not its final output. By probing decoder layers, we uncover a previously overlooked behavior, overthinking: models repeatedly revise object hypotheses across layers before committing to an incorrect answer. Once the model latches onto a confounded hypothesis, it can propagate through subsequent layers, ultimately causing hallucination. To capture this behavior, we introduce the Overthinking Score, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. This score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.
152. 【2603.07615】Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models
链接:https://arxiv.org/abs/2603.07615
作者:Jiajun He,Zongyu Guo,Zhaoyang Jia,Xiaoyi Zhang,Jiahao Li,Xiao Li,Bin Li,José Miguel Hernández-Lobato,Yan Lu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern visual generative, models acquire rich, generative models acquire, rich visual knowledge, acquire rich visual
备注:
点击查看摘要
Abstract:Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.
153. 【2603.07614】Looking Into the Water by Unsupervised Learning of the Surface Shape
链接:https://arxiv.org/abs/2603.07614
作者:Ori Lifschitz,Tali Treibitz,Dan Rosenbaum
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remove image distortions, image distortions caused, water surface, address the problem, seek to remove
备注:
点击查看摘要
Abstract:We address the problem of looking into the water from the air, where we seek to remove image distortions caused by refractions at the water surface. Our approach is based on modeling the different water surface structures at various points in time, assuming the underlying image is constant. To this end, we propose a model that consists of two neural-field networks. The first network predicts the height of the water surface at each spatial position and time, and the second network predicts the image color at each position. Using both networks, we reconstruct the observed sequence of images and can therefore use unsupervised training. We show that using implicit neural representations with periodic activation functions (SIREN) leads to effective modeling of the surface height spatio-temporal signal and its derivative, as required for image reconstruction. Using both simulated and real data we show that our method outperforms the latest unsupervised image restoration approach. In addition, it provides an estimate of the water surface.
154. 【2603.07604】EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation
链接:https://arxiv.org/abs/2603.07604
作者:Arpita Saggar,Jonathan C. Darling,Duygu Sarikaya,David C. Hogg
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, Real-time talking head, synthesis increasingly relies, Real-time talking, relies on deformable
备注: Preprint
点击查看摘要
Abstract:Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce $\textbf{EmbedTalk}$, which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.
155. 【2603.07593】Fast Attention-Based Simplification of LiDAR Point Clouds for Object Detection and Classification
链接:https://arxiv.org/abs/2603.07593
作者:Z. Rozsa,Á. Madaras,Q. Wei,X. Lu,M. Golarits,H. Yuan,T. Sziranyi,R. Hamzaoui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:represent surrounding objects, traffic signs, autonomous driving, driving and consist, consist of large
备注:
点击查看摘要
Abstract:LiDAR point clouds are widely used in autonomous driving and consist of large numbers of 3D points captured at high frequency to represent surrounding objects such as vehicles, pedestrians, and traffic signs. While this dense data enables accurate perception, it also increases computational cost and power consumption, which can limit real-time deployment. Existing point cloud sampling methods typically face a trade-off: very fast approaches tend to reduce accuracy, while more accurate methods are computationally expensive. To address this limitation, we propose an efficient learned point cloud simplification method for LiDAR data. The method combines a feature embedding module with an attention-based sampling module to prioritize task-relevant regions and is trained end-to-end. We evaluate the method against farthest point sampling (FPS) and random sampling (RS) on 3D object detection on the KITTI dataset and on object classification across four datasets. The method was consistently faster than FPS and achieved similar, and in some settings better, accuracy, with the largest gains under aggressive downsampling. It was slower than RS, but it typically preserved accuracy more reliably at high sampling ratios.
156. 【2603.07590】Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
链接:https://arxiv.org/abs/2603.07590
作者:Chenxi Li,Xianggan Liu,Dake Shen,Yaosong Du,Zhibo Yao,Hao Jiang,Linyi Jiang,Chengwei Cao,Jingzhe Zhang,RanYi Peng,Peiling Bai,Xiande Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, progress of Large, Large Vision-Language, visual modalities introduces, rapid progress
备注:
点击查看摘要
Abstract:Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.
157. 【2603.07587】3DGS-HPC: Distractor-free 3D Gaussian Splatting with Hybrid Patch-wise Classification
链接:https://arxiv.org/abs/2603.07587
作者:Jiahao Chen,Yipeng Qin,Ganlong Zhao,Xin Li,Wenping Wang,Guanbin Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, real-world environments due, Gaussian Splatting, scene reconstruction, varying shadows
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and 3D scene reconstruction, yet its quality often degrades in real-world environments due to transient distractors, such as moving objects and varying shadows. Existing methods commonly rely on semantic cues extracted from pre-trained vision models to identify and suppress these distractors, but such semantics are misaligned with the binary distinction between static and transient regions and remain fragile under the appearance perturbations introduced during 3DGS optimization. We propose 3DGS-HPC, a framework that circumvents these limitations by combining two complementary principles: a patch-wise classification strategy that leverages local spatial consistency for robust region-level decisions, and a hybrid classification metric that adaptively integrates photometric and perceptual cues for more reliable separation. Extensive experiments demonstrate the superiority and robustness of our method in mitigating distractors to improve 3DGS-based novel view synthesis.
158. 【2603.07577】Integration of deep generative Anomaly Detection algorithm in high-speed industrial line
链接:https://arxiv.org/abs/2603.07577
作者:Niccolò Ferrari,Nicola Zanarini,Michele Fraccaroli,Alice Bizzarri,Evelina Lamma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:pharmaceutical production requires, requires high accuracy, Industrial visual inspection, hardware footprint, cycle time
备注: Preprint under review at a Springer Nature journal. 36 pages, 3 tables, 29 figures. Updated and expanded version of the SSRN preprint (abstract_id=4858664), with substantial revisions and Springer Nature formatting
点击查看摘要
Abstract:Industrial visual inspection in pharmaceutical production requires high accuracy under strict constraints on cycle time, hardware footprint, and operational cost. Manual inline inspection is still common, but it is affected by operator variability and limited throughput. Classical rule-based computer vision pipelines are often rigid and difficult to scale to highly variable production scenarios. To address these limitations, we present a semi-supervised anomaly detection framework based on a generative adversarial architecture with a residual autoencoder and a dense bottleneck, specifically designed for online deployment on a high-speed Blow-Fill-Seal (BFS) line. The model is trained only on nominal samples and detects anomalies through reconstruction residuals, providing both classification and spatial localization via heatmaps. The training set contains 2,815,200 grayscale patches. Experiments on a real industrial test kit show high detection performance while satisfying timing constraints compatible with a 500 ms acquisition slot.
159. 【2603.07571】A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification
链接:https://arxiv.org/abs/2603.07571
作者:Furkan Genç,Onat Özdemir,Emre Akbaş
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Prototype Loss, safety-sensitive applications, Loss, critical in safety-sensitive, Cross-Entropy Loss
备注:
点击查看摘要
Abstract:Out-of-distribution (OOD) detection is critical in safety-sensitive applications. While this challenge has been addressed from various perspectives, the influence of training objectives on OOD behavior remains comparatively underexplored. In this paper, we present a systematic comparison of four widely used training objectives: Cross-Entropy Loss, Prototype Loss, Triplet Loss, and Average Precision (AP) Loss, spanning probabilistic, prototype-based, metric-learning, and ranking-based supervision, for OOD detection in image classification under standardized OpenOOD protocols. Across CIFAR-10/100 and ImageNet-200, we find that Cross-Entropy Loss, Prototype Loss, and AP Loss achieve comparable in-distribution accuracy, while Cross-Entropy Loss provides the most consistent near- and far-OOD performance overall; the other objectives can be competitive in specific settings.
160. 【2603.07570】Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance
链接:https://arxiv.org/abs/2603.07570
作者:Guodong Sun,Junjie Liu,Gaoyang Zhang,Bo Wu,Yang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scene understanding plays, robotic systems, plays a critical, critical role, role in enabling
备注: 23 pages, 13 figures
点击查看摘要
Abstract:Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.
161. 【2603.07566】GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module
链接:https://arxiv.org/abs/2603.07566
作者:Niccolò Ferrari,Michele Fraccaroli,Evelina Lamma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Toggle, Code Toggle Papers, Anomaly detection, Toggle Hugging Face, Code
备注: Peer-reviewed journal version published. 18 pages, 12 figures, 7 tables
点击查看摘要
Abstract:Anomaly detection is nowadays increasingly used in industrial applications and processes. One of the main fields of the appliance is the visual inspection for surface anomaly detection, which aims to spot regions that deviate from regularity and consequently identify abnormal products. Defect localization is a key task, that usually is achieved using a basic comparison between generated image and the original one, implementing some blob-analysis or image-editing algorithms, in the post-processing step, which is very biased towards the source dataset, and they are unable to generalize. Furthermore, in industrial applications, the totality of the image is not always interesting but could be one or some regions of interest (ROIs), where only in those areas there are relevant anomalies to be spotted. For these reasons, we propose a new architecture composed by two blocks. The first block is a Generative Adversarial Network (GAN), based on a residual autoencoder (ResAE), to perform reconstruction and denoising processes, while the second block produces image segmentation, spotting defects. This method learns from a dataset composed of good products and generated synthetic defects. The discriminative network is trained using a ROI for each image contained in the training dataset. The network will learn in which area anomalies are relevant. This approach guarantees the reduction of using pre-processing algorithms, formerly developed with blob-analysis and image-editing procedures. To test our model we used challenging MVTec anomaly detection datasets and an industrial large dataset of pharmaceutical BFS strips of vials. This set constitutes a more realistic use case of the aforementioned network.
Comments:
Peer-reviewed journal version published. 18 pages, 12 figures, 7 tables
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
ACMclasses:
I.2.6; I.4.8
Cite as:
arXiv:2603.07566 [cs.CV]
(or
arXiv:2603.07566v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07566
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
International Journal of Intelligent Systems, vol. 2023, Article ID 7773481, 2023
Related DOI:
https://doi.org/10.1155/2023/7773481
Focus to learn more
DOI(s) linking to related resources
Submission history From: Niccolò Ferrari [view email] [v1]
Sun, 8 Mar 2026 10:02:17 UTC (2,809 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module, by Niccol`o Ferrari and 2 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2026-03
Change to browse by:
cs
cs.AI
cs.LG
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
162. 【2603.07564】SiamGM: Siamese Geometry-Aware and Motion-Guided Network for Real-Time Satellite Video Object Tracking
链接:https://arxiv.org/abs/2603.07564
作者:Zixiao Wen,Zhen Yang,Jiawei Li,Xiantai Xiang,Guangyao Zhou,Yuxin Hu,Yuhan Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:frequent visual occlusions, Single object tracking, Single object, large aspect ratio, visual occlusions
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Single object tracking in satellite videos is inherently challenged by small target, blurred background, large aspect ratio changes, and frequent visual occlusions. These constraints often cause appearance-based trackers to accumulate errors and lose targets irreversibly. To systematically mitigate both spatial ambiguities and temporal information loss, we propose SiamGM, a novel geometry-aware and motion-guided Siamese network. From a spatial perspective, we introduce an Inter-Frame Graph Attention (IFGA) module, closely integrated with an Aspect Ratio-Constrained Label Assignment (LA) method, establishing fine-grained topological correspondences and explicitly preventing surrounding background noise. From a temporal perspective, we introduce the Motion Vector-Guided Online Tracking Optimization method. By adopting the Normalized Peak-to-Sidelobe Ratio (nPSR) as a dynamic confidence indicator, we propose an Online Motion Model Refinement (OMMR) strategy to utilize historical trajectory information. Evaluations on two challenging SatSOT and SV248S benchmarks confirm that SiamGM outperforms most state-of-the-art trackers in both precision and success metrics. Notably, the proposed components of SiamGM introduce virtually no computational overhead, enabling real-time tracking at 130 frames per second (FPS). Codes and tracking results are available at this https URL.
163. 【2603.07562】Brain-WM: Brain Glioblastoma World Model
链接:https://arxiv.org/abs/2603.07562
作者:Chenhui Wang,Boyun Zheng,Liuxin Bao,Zhihao Peng,Peter Y.M. Woo,Hongming Shan,Yixuan Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Precise prognostic modeling, Precise prognostic, varying treatment interventions, modeling of glioblastoma, future MRI generation
备注:
点击查看摘要
Abstract:Precise prognostic modeling of glioblastoma (GBM) under varying treatment interventions is essential for optimizing clinical outcomes. While generative AI has shown promise in simulating GBM evolution, existing methods typically treat interventions as static conditional inputs rather than dynamic decision variables. Consequently, they fail to capture the complex, reciprocal interplay between tumor evolution and treatment response. To bridge this gap, we present Brain-WM, a pioneering brain GBM world model that unifies next-step treatment prediction and future MRI generation, thereby capturing the co-evolutionary dynamics between tumor and treatment. Specifically, Brain-WM encodes spatiotemporal dynamics into a shared latent space for joint autoregressive treatment prediction and flow-based future MRI generation. Then, instead of a conventional monolithic framework, Brain-WM adopts a novel Y-shaped Mixture-of-Transformers (MoT) architecture. This design structurally disentangles heterogeneous objectives, successfully leveraging cross-task synergies while preventing feature collapse. Finally, a synergistic multi-timepoint mask alignment objective explicitly anchors latent representations to anatomically grounded tumor structures and progression-aware semantics. Extensive validation on internal and external multi-institutional cohorts demonstrates the superiority of Brain-WM, achieving 91.5% accuracy in treatment planning and SSIMs of 0.8524, 0.8581, and 0.8404 for FLAIR, T1CE, and T2W sequences, respectively. Ultimately, Brain-WM offers a robust clinical sandbox for optimizing patient healthcare. The source code is made available at this https URL.
164. 【2603.07561】PureCC: Pure Learning for Text-to-Image Concept Customization
链接:https://arxiv.org/abs/2603.07561
作者:Zhichao Liao,Xiaole Xian,Qingyu Li,Wenyu Qin,Meng Wang,Weicheng Xie,Siyang Song,Pingfa Feng,Long Zeng,Liang Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable outcomes, Existing concept customization, Existing concept, methods have achieved, achieved remarkable
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $\lambda^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at this https URL.
165. 【2603.07559】Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
链接:https://arxiv.org/abs/2603.07559
作者:Weijia Feng,Jingyu Yang,Ruojia Zhang,Fengtao Sun,Qian Gao,Chenyang Wang,Tongtong Su,Jia Guo,Xiaobai Li,Minglai Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:holding great potential, transient movements triggered, Expected Free Energy, emotional activities, holding great
备注: 10 pages, accepted by CVPR 2026
点击查看摘要
Abstract:Micro-gestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human-computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference-based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.
166. 【2603.07552】ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction
链接:https://arxiv.org/abs/2603.07552
作者:Haibao Yu,Kuntao Xiao,Jiahang Wang,Ruiyang Hao,Yuxin Huang,Guoran Hu,Haifang Qin,Bowen Jing,Yuntian Bo,Ping Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:realistic closed-loop evaluation, High-fidelity visual reconstruction, closed-loop evaluation, evaluation in autonomous, Hybrid Gaussian Prediction
备注:
点击查看摘要
Abstract:High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.
167. 【2603.07545】DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration
链接:https://arxiv.org/abs/2603.07545
作者:Jinzhou Tang,Fan Feng,Minghao Fu,Wenjun Lin,Biwei Huang,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Learned world models, Learned world, interpolative generalization, extrapolative generalization, excel at interpolative
备注: 19 pages, 5 figures
点击查看摘要
Abstract:Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.
168. 【2603.07543】CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
链接:https://arxiv.org/abs/2603.07543
作者:Anh-Duy Le,Van-Linh Pham,Thanh-Nam Vo,Xuan Toan Mai,Tuan-Anh Tran
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:remains challenging due, achieving impressive results, One-shot styled handwriting, styled handwriting image, single reference image
备注: Accepted as oral presentation at WACV 2026
点击查看摘要
Abstract:One-shot styled handwriting image generation, despite achieving impressive results in recent years, remains challenging due to the difficulty in capturing the intricate and diverse characteristics of human handwriting by using solely a single reference image. Existing methods still struggle to generate visually appealing and realistic handwritten images and adapt to complex, unseen writer styles, struggling to isolate invariant style features (e.g., slant, stroke width, curvature) while ignoring irrelevant noise. To tackle this problem, we introduce Patch Contrastive Enhancement and Style-Aware Quantization via Denoising Diffusion (CONSTANT), a novel one-shot handwriting generation via diffusion model. CONSTANT leverages three key innovations: 1) a Style-Aware Quantization (SAQ) module that models style as discrete visual tokens capturing distinct concepts; 2) a contrastive objective to ensure these tokens are well-separated and meaningful in the embedding style space; 3) a latent patch-based contrastive (LLatentPCE) objective help improving quality and local structures by aligning multiscale spatial patches of generated and real features in latent space. Extensive experiments and analysis on benchmark datasets from multiple languages, including English, Chinese, and our proposed ViHTGen dataset for Vietnamese, demonstrate the superiority of adapting to new reference styles and producing highly detailed images of our method over state-of-the-art approaches. Code is available at GitHub
169. 【2603.07540】How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
链接:https://arxiv.org/abs/2603.07540
作者:Haoyu Chen,Qing Liu,Yuqian Zhou,He Zhang,Zhaowen Wang,Mengwei Ren,Jingjing Ren,Xiang Wang,Zhe Lin,Lei Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:coherent long-form stories, Unified multimodal models, Unified multimodal, multimodal models hold, interleaved narratives
备注:
点击查看摘要
Abstract:Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
170. 【2603.07535】Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric Approach
链接:https://arxiv.org/abs/2603.07535
作者:Yibin Ye,Shuo Chen,Kun Wang,Xiaokai Song,Jisheng Dang,Qifeng Yu,Xichao Teng,Zhang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cross-View Geo-Localization, satellite images plays, UAV, plays a crucial, crucial role
备注: 14 pages
点击查看摘要
Abstract:Cross-View Geo-Localization (CVGL) between UAV imagery and satellite images plays a crucial role in target localization and UAV self-positioning. However, most existing methods rely on the idealized assumption of scale consistency between UAV queries and satellite galleries, overlooking the severe scale ambiguity commonly encountered in real-world scenarios. This discrepancy leads to field-of-view misalignment and feature mismatch, significantly degrading CVGL robustness. To address this issue, we propose a geometric framework that recovers the absolute metric scale from monocular UAV images using semantic anchors. Specifically, small vehicles (SVs), characterized by relatively stable prior size distributions and high detectability, are exploited as metric references. A Decoupled Stereoscopic Projection Model is introduced to estimate the absolute image scale from these semantic targets. By decomposing vehicle dimensions into radial and tangential components, the model compensates for perspective distortions in 2D detections of 3D vehicles, enabling more accurate scale estimation. To further reduce intra-class size variation and detection noise, a dual-dimension fusion strategy with Interquartile Range (IQR)-based robust aggregation is employed. The estimated global scale is then used as a physical constraint for scale-adaptive satellite image cropping, improving UAV-to-satellite feature alignment. Experiments on augmented DenseUAV and UAV-VisLoc datasets demonstrate that the proposed method significantly improves CVGL robustness under unknown UAV image scales. Additionally, the framework shows strong potential for downstream applications such as passive UAV altitude estimation and 3D model scale recovery.
171. 【2603.07533】ACCURATE: Arbitrary-shaped Continuum Reconstruction Under Robust Adaptive Two-view Estimation
链接:https://arxiv.org/abs/2603.07533
作者:Yaozhi Zhang,Shun Yu,Yugang Zhang,Yang Liu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:arbitrary-shaped long slender, slender continuum bodies, soft continuum manipulators, accurate mechanical simulation, long slender continuum
备注:
点击查看摘要
Abstract:Accurate reconstruction of arbitrary-shaped long slender continuum bodies, such as guidewires, catheters and other soft continuum manipulators, is essential for accurate mechanical simulation. However, existing image-based reconstruction approaches often suffer from limited accuracy because they often underutilize camera geometry, or lack generality as they rely on rigid geometric assumptions that may fail for continuum robots with complex and highly deformable shapes. To address these limitations, we propose ACCURATE, a 3D reconstruction framework integrating an image segmentation neural network with a geometry-constrained topology traversal and dynamic programming algorithm that enforces global biplanar geometric consistency, minimizes the cumulative point-to-epipolar-line distance, and remains robust to occlusions and epipolar ambiguities cases caused by noise and discretization. Our method achieves high reconstruction accuracy on both simulated and real phantom datasets acquired using a clinical X-ray C-arm system, with mean absolute errors below 1.0 mm.
172. 【2603.07521】SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition
链接:https://arxiv.org/abs/2603.07521
作者:Shilong Chen,Mingyuan Li,Zhaoyang Wang,Zhonglin Ye,Haixing Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:work investigates large-scale, graph-native perspective, stroke sequences, investigates large-scale sketch, large-scale sketch recognition
备注:
点击查看摘要
Abstract:This work investigates large-scale sketch recognition from a graph-native perspective, where free-hand sketches are directly modeled as structured graphs rather than raster images or stroke sequences. We propose SketchGraphNet, a hybrid graph neural architecture that integrates local message passing with a memory-efficient global attention mechanism, without relying on auxiliary positional or structural encodings. To support systematic evaluation, we construct SketchGraph, a large-scale benchmark comprising 3.44 million graph-structured sketches across 344 categories, with two variants (A and R) to reflect different noise conditions. Each sketch is represented as a spatiotemporal graph with normalized stroke-order attributes. On SketchGraph-A and SketchGraph-R, SketchGraphNet achieves Top-1 accuracies of 83.62% and 87.61%, respectively, under a unified training configuration. MemEffAttn further reduces peak GPU memory by over 40% and training time by more than 30% compared with Performer-based global attention, while maintaining comparable accuracy.
173. 【2603.07515】EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification
链接:https://arxiv.org/abs/2603.07515
作者:Binjia Zhou,Dawei Luo,Shuai Chen,Feng Xu,Seow,Haoyuan Li,Jiachi Wang,Jiawen Wang,Zunlei Feng,Yijun Bei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:security challenges posed, AIGC technology, advancement of AIGC, developing identification methods, rapid advancement
备注:
点击查看摘要
Abstract:With the rapid advancement of AIGC technology, developing identification methods to address the security challenges posed by deepfakes has become urgent. Face forgery identification techniques can be categorized into two types: traditional classification methods and explainable VLM approaches. The former provides classification results but lacks explanatory ability, while the latter, although capable of providing coarse-grained explanations, often suffers from hallucinations and insufficient detail. To overcome these limitations, we propose EvolveReason, which mimics the reasoning and observational processes of human auditors when identifying face forgeries. By constructing a chain-of-thought dataset, CoT-Face, tailored for advanced VLMs, our approach guides the model to think in a human-like way, prompting it to output reasoning processes and judgment results. This provides practitioners with reliable analysis and helps alleviate hallucination. Additionally, our framework incorporates a forgery latent-space distribution capture module, enabling EvolveReason to identify high-frequency forgery cues difficult to extract from the original images. To further enhance the reliability of textual explanations, we introduce a self-evolution exploration strategy, leveraging reinforcement learning to allow the model to iteratively explore and optimize its textual descriptions in a two-stage process. Experimental results show that EvolveReason not only outperforms the current state-of-the-art methods in identification performance but also accurately identifies forgery details and demonstrates generalization capabilities.
174. 【2603.07514】A Unified View of Drifting and Score-Based Models
链接:https://arxiv.org/abs/2603.07514
作者:Chieh-Hsin Lai,Bac Nguyen,Naoki Murata,Yuhta Takida,Toshimitsu Uesaka,Yuki Mitsufuji,Stefano Ermon,Molei Tao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:train one-step generators, models train one-step, mean-shift discrepancy induced, Drifting models train, default in practice
备注:
点击查看摘要
Abstract:Drifting models train one-step generators by optimizing a mean-shift discrepancy induced by a kernel between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, yielding a transport direction for generated samples. In this paper, we make its relationship to the score-matching principle behind diffusion models precise by showing that drifting admits a score-based formulation on kernel-smoothed distributions. For Gaussian kernels, the population mean-shift field coincides with the score difference between the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to the corresponding conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching-style objective on smoothed distributions. It also clarifies the connection to Distribution Matching Distillation (DMD): both methods use score-mismatch transport directions, but drifting realizes the score signal nonparametrically from kernel neighborhoods, whereas DMD uses a pretrained diffusion teacher. Beyond Gaussians, we derive an exact decomposition for general radial kernels, and for the Laplace kernel we prove rigorous error bounds showing that drifting remains an accurate proxy for score matching in low-temperature and high-dimensional regimes.
175. 【2603.07504】High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion
链接:https://arxiv.org/abs/2603.07504
作者:Guoqing Zhang,Jingyun Yang,Siqi Chen,Anping Zhang,Yang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Anatomy shape modeling, Anatomy shape, fundamental problem, medical data analysis, Anatomy
备注: 10 pages, 5 figures, journal
点击查看摘要
Abstract:Anatomy shape modeling is a fundamental problem in medical data analysis. However, the geometric complexity and topological variability of anatomical structures pose significant challenges to accurate anatomical shape generation. In this work, we propose a skeletal latent diffusion framework that explicitly incorporates structural priors for efficient and high-fidelity medical shape generation. We introduce a shape auto-encoder in which the encoder captures global geometric information through a differentiable skeletonization module and aggregates local surface features into shape latents, while the decoder predicts the corresponding implicit fields over sparsely sampled coordinates. New shapes are generated via a latent-space diffusion model, followed by neural implicit decoding and mesh extraction. To address the limited availability of medical shape data, we construct a large-scale dataset, \textit{MedSDF}, comprising surface point clouds and corresponding signed distance fields across multiple anatomical categories. Extensive experiments on MedSDF and vessel datasets demonstrate that the proposed method achieves superior reconstruction and generation quality while maintaining a higher computational efficiency compared with existing approaches. Code is available at: this https URL.
176. 【2603.07497】AMR-CCR: Anchored Modular Retrieval for Continual Chinese Character Recognition
链接:https://arxiv.org/abs/2603.07497
作者:Yuchuan Wu,Yinglian Zhu,Haiyang Yu,Ke Niu,Bin Li,Xiangyang Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Ancient Chinese character, Chinese character recognition, cultural heritage digitization, Ancient Chinese, Continual Chinese Character
备注:
点击查看摘要
Abstract:Ancient Chinese character recognition is a core capability for cultural heritage digitization, yet real-world workflows are inherently non-stationary: newly excavated materials are continuously onboarded, bringing new classes in different scripts, and expanding the class space over time. We formalize this process as Continual Chinese Character Recognition (Continual CCR), a script-staged, class-incremental setting that couples two challenges: (i) scalable learning under continual class growth with subtle inter-class differences and scarce incremental data, and (ii) pronounced intra-class diversity caused by writing-style variations across writers and carrier conditions. To overcome the limitations of conventional closed-set classification, we propose AMR-CCR, an anchored modular retrieval framework that performs recognition via embedding-based dictionary matching in a shared multimodal space, allowing new classes to be added by simply extending the dictionary. AMR-CCR further introduces a lightweight script-conditioned injection module (SIA+SAR) to calibrate newly onboarded scripts while preserving cross-stage embedding compatibility, and an image-derived multi-prototype dictionary that clusters within-class embeddings to better cover diverse style modes. To support systematic evaluation, we build EvoCON, a six-stage benchmark for continual script onboarding, covering six scripts (OBC, BI, SS, SAC, WSC, CS), augmented with meaning/shape descriptions and an explicit zero-shot split for unseen characters without image exemplars.
177. 【2603.07494】DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding
链接:https://arxiv.org/abs/2603.07494
作者:Yuchuan Wu,Minghan Zhuo,Teng Fu,Mengyang Zhao,Bin Li,Xiangyang Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, current document MLLMs, language models, high-stakes scenarios
备注:
点击查看摘要
Abstract:Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.
178. 【2603.07493】RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection
链接:https://arxiv.org/abs/2603.07493
作者:Rui Ding,Zhaonian Kuang,Zongwei Zhou,Meng Yang,Xinhu Zheng,Gang Hua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:bird eye view, predict accurate depth, detection with bird, eye view, driving and robotics
备注:
点击查看摘要
Abstract:Multi-view 3D detection with bird's eye view (BEV) is crucial for autonomous driving and robotics, but its robustness in real-world is limited as it struggles to predict accurate depth values. A mainstream solution, cross-modal distillation, transfers depth information from LiDAR to camera models but also unintentionally transfers depth-irrelevant information (e.g. LiDAR density). To mitigate this issue, we propose RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object. It is based on the fundamental imaging principle that predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. Therefore, distilling along the ray enables more effective depth information transfer. More specifically, we design two ray-based distillation modules. Ray-based Contrastive Distillation (RCD) incorporates contrastive learning into distillation by sampling along the ray to learn how LiDAR accurately locates objects. Ray-based Weighted Distillation (RWD) adaptively adjusts distillation weight based on the ray to minimize the interference of depth-irrelevant information in LiDAR. For validation, we widely apply RayD3D into three representative types of BEV-based models, including BEVDet, BEVDepth4D, and BEVFormer. Our method is trained on clean NuScenes, and tested on both clean NuScenes and RoboBEV with a variety types of data corruptions. Our method significantly improves the robustness of all the three base models in all scenarios without increasing inference costs, and achieves the best when compared to recently released multi-view and distillation models.
179. 【2603.07489】RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations
链接:https://arxiv.org/abs/2603.07489
作者:Hao Wang,Yuanfan Li,Qi Zhou,Zhankuo Xu,Jiong Ni,Xin Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Snapshot Compressive Imaging, Deep learning algorithms, video Snapshot Compressive, Compressive Imaging, achieved great success
备注:
点击查看摘要
Abstract:Deep learning algorithms for video Snapshot Compressive Imaging (SCI) have achieved great success, yet they predominantly focus on reconstructing from clean measurements. This overlooks a critical real-world challenge: the captured signal itself is often severely degraded by motion blur and low light. Consequently, existing models falter in practical applications. To break this limitation, we pioneer the first study on robust video SCI restoration, shifting the goal from "reconstruction" to "restoration"--recovering the underlying pristine scene from a degraded measurement. To facilitate this new task, we first construct a large-scale benchmark by simulating realistic, continuous degradations on the DAVIS 2017 dataset. Second, we propose RobustSCI, a network that enhances a strong encoder-decoder backbone with a novel RobustCFormer block. This block introduces two parallel branches--a multi-scale deblur branch and a frequency enhancement branch--to explicitly disentangle and remove degradations during the recovery process. Furthermore, we introduce RobustSCI-C (RobustSCI-Cascade), which integrates a pre-trained Lightweight Post-processing Deblurring Network to significantly boost restoration performance with minimal overhead. Extensive experiments demonstrate that our methods outperform all SOTA models on the new degraded testbeds, with additional validation on real-world degraded SCI data confirming their practical effectiveness, elevating SCI from merely reconstructing what is captured to restoring what truly happened.
180. 【2603.07486】Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection
链接:https://arxiv.org/abs/2603.07486
作者:Rui Ding,Zhaonian Kuang,Yuzhe Ji,Meng Yang,Xinhu Zheng,Gang Hua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:bird eye view, achieved desired advances, http URL, data corruption, data
备注:
点击查看摘要
Abstract:Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct this http URL invariant features can be recovered across modalities for robust fusion under data this http URL this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the this http URL then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and this http URL each expert, we use modality-invariant features as robust information, while modality-specific features serve as a this http URL, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.
181. 【2603.07476】EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
链接:https://arxiv.org/abs/2603.07476
作者:Wenqi Cai,Yawen Zou,Guang Li,Chunzhi Gu,Chao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthesize compact training, compact training sets, significantly fewer samples, achieve high accuracy, aims to synthesize
备注: CVPR2026 (main conference)
点击查看摘要
Abstract:Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision-Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Source code is available at this https URL.
182. 【2603.07468】FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation
链接:https://arxiv.org/abs/2603.07468
作者:Xiaokang Zhang,Xuran Xiong,Jianzhong Huang,Lefei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Remote sensing image, sensing image segmentation, sharing raw imagery, Remote sensing, gained increasing attention
备注: 14 pages, 8 figures
点击查看摘要
Abstract:Remote sensing image segmentation (RSIS) in federated environments has gained increasing attention because it enables collaborative model training across distributed datasets without sharing raw imagery or annotations. Federated RSIS combined with parameter-efficient fine-tuning (PEFT) can unleash the generalization power of pretrained foundation models for real-world applications, with minimal parameter aggregation and communication overhead. However, the dynamic adaptation of pretrained models to heterogeneous client data inevitably increases update uncertainty and compromises the reliability of collaborative optimization due to the lack of uncertainty estimation for each local model. To bridge this gap, we present FedEU, a federated optimization framework for fine-tuning RSIS models driven by evidential uncertainty. Specifically, personalized evidential uncertainty modeling is introduced to quantify epistemic variations of local models and identify high-risk areas under local data distributions. Furthermore, the client-specific feature embedding (CFE) is exploited to enhance channel-aware feature representation while preserving client-specific properties through personalized attention and an element-aware parameter update approach. These uncertainty estimates are uploaded to the server to enable adaptive global aggregation via a Top-k uncertainty-guided weighting (TUW) strategy, which mitigates the impact of distribution shifts and unreliable updates. Extensive experiments on three large-scale heterogeneous datasets demonstrate the superior performance of FedEU. More importantly, FedEU enables balanced model adaptation across diverse clients by explicitly reducing prediction uncertainty, resulting in more robust and reliable federated outcomes. The source codes will be available at this https URL.
183. 【2603.07465】Classifying Novel 3D-Printed Objects without Retraining: Towards Post-Production Automation in Additive Manufacturing
链接:https://arxiv.org/abs/2603.07465
作者:Fanis Mathioulakis,Gorjan Radevski,Silke GC Cleuren,Michel Janssens,Brecht Das,Koen Schauwaert,Tinne Tuytelaars
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:industrial additive manufacturing, automating post-production workflows, Reliable classification, additive manufacturing, CAD models
备注:
点击查看摘要
Abstract:Reliable classification of 3D-printed objects is essential for automating post-production workflows in industrial additive manufacturing. Despite extensive automation in other stages of the printing pipeline, this task still relies heavily on manual inspection, as the set of objects to be classified can change daily, making frequent model retraining impractical. Automating the identification step is therefore critical for improving operational efficiency. A vision model that could classify any set of objects by utilizing their corresponding CAD models and avoiding retraining would be highly beneficial in this setting. To enable systematic evaluation of vision models on this task, we introduce ThingiPrint, a new publicly available dataset that pairs CAD models with real photographs of their 3D-printed counterparts. Using ThingiPrint, we benchmark a range of existing vision models on the task of 3D-printed object classification. We additionally show that contrastive fine-tuning with a rotation-invariant objective allows effective prototype-based classification of previously unseen 3D-printed objects. By relying solely on the available CAD models, this avoids the need for retraining when new objects are introduced. Experiments show that this approach outperforms standard pretrained baselines, suggesting improved generalization and practical relevance for real-world use.
184. 【2603.07464】Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection
链接:https://arxiv.org/abs/2603.07464
作者:Rui Ding,Meng Yang,Nanning Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous vehicles due, accurate depth information, depth information, promising yet ill-posed, ill-posed task
备注:
点击查看摘要
Abstract:Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.
185. 【2603.07463】SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing
链接:https://arxiv.org/abs/2603.07463
作者:Xiaokang Zhang,Bo Li,Chufeng Zhou,Weikang Yu,Lefei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sensing image interpretation, fine-tuning have emerged, remote sensing image, remote sensing images, dynamic token masking
备注: 17pages,10figures
点击查看摘要
Abstract:Pretraining and fine-tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)-based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial-spectral features. To address this, we propose a simple yet effective approach, Spectral Index-Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain-specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency-Guided Dynamic Token Masking (SSDTM), a curriculum-style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial-spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at this https URL.
186. 【2603.07455】Image Generation Models: A Technical History
链接:https://arxiv.org/abs/2603.07455
作者:Rouzbeh Shirvani
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
关键词:image generation models, breakthrough image generation, Image generation, past decade, application domains
备注:
点击查看摘要
Abstract:Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.
187. 【2603.07454】SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition
链接:https://arxiv.org/abs/2603.07454
作者:Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari,Mert D. Pesé
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Geometric Modulation Unit, deep MLP based, Adaptive Point Embedding, MLP based models, cloud recognition designed
备注: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
点击查看摘要
Abstract:We present SLNet, a lightweight backbone for 3D point cloud recognition designed to achieve strong performance without the computational cost of many recent attention, graph, and deep MLP based models. The model is built on two simple ideas: NAPE (Nonparametric Adaptive Point Embedding), which captures spatial structure using a combination of Gaussian RBF and cosine bases with input adaptive bandwidth and blending, and GMU (Geometric Modulation Unit), a per channel affine modulator that adds only 2D learnable parameters. These components are used within a four stage hierarchical encoder with FPS+kNN grouping, nonparametric normalization, and shared residual MLPs. In experiments, SLNet shows that a very small model can still remain highly competitive across several 3D recognition tasks. On ModelNet40, SLNet-S with 0.14M parameters and 0.31 GFLOPs achieves 93.64% overall accuracy, outperforming PointMLP-elite with 5x fewer parameters, while SLNet-M with 0.55M parameters and 1.22 GFLOPs reaches 93.92%, exceeding PointMLP with 24x fewer parameters. On ScanObjectNN, SLNet-M achieves 84.25% overall accuracy within 1.2 percentage points of PointMLP while using 28x fewer parameters. For large scale scene segmentation, SLNet-T extends the backbone with local Point Transformer attention and reaches 58.2% mIoU on S3DIS Area 5 with only 2.5M parameters, more than 17x fewer than Point Transformer V3. We also introduce NetScore+, which extends NetScore by incorporating latency and peak memory so that efficiency can be evaluated in a more deployment oriented way. Across multiple benchmarks and hardware settings, SLNet delivers a strong overall balance between accuracy and efficiency. Code is available at: this https URL.
188. 【2603.07443】Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.07443
作者:Dunyuan Xu,Xikai Yang,Juzheng Miao,Yaoqian Li,Jinpeng Li,Pheng-Ann Heng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Medical Multimodal Large, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Medical Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse healthcare tasks. However, current post-training strategies, such as supervised fine-tuning and reinforcement learning, heavily depend on substantial annotated data while overlooking the potential of unlabeled test data for model enhancement. This limitation becomes particularly pronounced in medical domains, where acquiring extensive labeled medical data is difficult due to the strict data sensitivity and annotation complexity. Moreover, leveraging test data poses challenges in generating reliable supervision signals from unlabeled samples and maintaining stable self-evolution. To address these limitations, we propose Med-Evo, the first self-evolution framework for medical MLLMs that utilizes label-free reinforcement learning to promote model performance without requiring additional labeled data. Our framework introduces two key innovations: $1)$ Feature-driven Pseudo Labeling (FPL) that identifies semantic centroids from all heterogeneous candidate responses to select pseudo labels in each rollout, and $2)$ Hard-Soft Reward (HSR) that combines exact match with token-level assessment and semantic similarity to provide hierarchical reward. Experiments on three medical VQA benchmarks and two base MLLMs show clear advantages of our approach over SOTA methods, with significant improvements of 10.43\% accuracy and 4.68\% recall on the SLAKE dataset using Qwen2.5-VL, showing the effectiveness of our method.
189. 【2603.07441】DogWeave: High-Fidelity 3D Canine Reconstruction from a Single Image via Normal Fusion and Conditional Inpainting
链接:https://arxiv.org/abs/2603.07441
作者:Shufan Sun,Chenchen Wang,Zongfu Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex articulation, fine-scale details, Monocular, challenging due, due to complex
备注:
点击查看摘要
Abstract:Monocular 3D animal reconstruction is challenging due to complex articulation, self-occlusion, and fine-scale details such as fur. Existing methods often produce distorted geometry and inconsistent textures due to the lack of articulated 3D supervision and limited availability of back-view images in 2D datasets, which makes reconstructing unobserved regions particularly difficult. To address these limitations, we propose DogWeave, a model-based framework for reconstructing high-fidelity 3D canine models from a single RGB image. DogWeave improves geometry by refining a coarsely-initiated parametric mesh into a detailed SDF representation through multi-view normal field optimization using diffusion-enhanced normals. It then generates view-consistent textures through conditional partial inpainting guided by structure and style cues, enabling realistic reconstruction of unobserved regions. Using only about 7,000 dog images processed via our 2D pipeline for training, DogWeave produces complete, realistic 3D models and outperforms state-of-the-art single image to 3d reconstruction methods in both shape accuracy and texture realism for canines.
190. 【2603.07436】RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation
链接:https://arxiv.org/abs/2603.07436
作者:Weikun Lin,Yunhao Bai,Yan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Training-free one-shot segmentation, one-shot segmentation offers, Training-free one-shot, support images, one-shot segmentation
备注: Under review at MICCAI 2026. 8 pages, 3 figures
点击查看摘要
Abstract:Training-free one-shot segmentation offers a scalable alternative to expert annotations where knowledge is often transferred from support images and foundation models. But existing methods often treat all pixels in support images and query response intensities models in a homogeneous way. They ignore the regional heterogeity in support images and response heterogeity in this http URL resolve this, we propose RPG-SAM, a framework that systematically tackles these heterogeneity gaps. Specifically, to address regional heterogeneity, we introduce Reliability-Weighted Prototype Mining (RWPM) to prioritize high-fidelity support features while utilizing background anchors as contrastive references for noise suppression. To address response heterogeneity, we develop Geometric Adaptive Selection (GAS) to dynamically recalibrate binarization thresholds by evaluating the morphological consensus of candidates. Finally, an iterative refinement loop method is designed to polishes anatomical boundaries. By accounting for multi-layered information heterogeneity, RPG-SAM achieves a 5.56\% mIoU improvement on the Kvasir dataset. Code will be released.
191. 【2603.07433】Data Agent: Learning to Select Data via End-to-End Dynamic Optimization
链接:https://arxiv.org/abs/2603.07433
作者:Suorong Yang,Fangjian Su,Hai Gan,Ziqi Ye,Jie Li,Baile Xu,Furao Shen,Soujanya Poria
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Dynamic Data selection, prioritizing informative samples, Dynamic Data, Data selection aims, Data selection
备注:
点击查看摘要
Abstract:Dynamic Data selection aims to accelerate training by prioritizing informative samples during online training. However, existing methods typically rely on task-specific handcrafted metrics or static/snapshot-based criteria to estimate sample importance, limiting scalability across learning paradigms and making it difficult to capture the evolving utility of data throughout training. To address this challenge, we propose Data Agent, an end-to-end dynamic data selection framework that formulates data selection as a training-aware sequential decision-making problem. The agent learns a sample-wise selection policy that co-evolves with model optimization, guided by a composite reward that integrates loss-based difficulty and confidence-based uncertainty signals. The reward signals capture complementary objectives of optimization impact and information gain, together with a tuning-free adaptive weighting mechanism that balances these signals over training. Extensive experiments across a wide range of datasets and architectures demonstrate that Data Agent consistently accelerates training while preserving or improving performance, e.g., reducing costs by over 50\% on ImageNet-1k and MMLU with lossless performance. Moreover, its dataset-agnostic formulation and modular reward make it plug-and-play across tasks and scenarios, e.g., robustness to noisy datasets, highlighting its potential in real-world scenarios.
192. 【2603.07432】Generalization in Online Reinforcement Learning for Mobile Agents
链接:https://arxiv.org/abs/2603.07432
作者:Li Gu,Zihuan Jiang,Zhixiang Chi,Huan Liu,Ziqiang Wang,Yuanhao Yu,Glen Berseth,Yang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Graphical user interface, interpreting natural-language instructions, based mobile agents, mobile agents automate, agents automate digital
备注:
点击查看摘要
Abstract:Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{this https URL}.
193. 【2603.07430】Disentangled Textual Priors for Diffusion-based Image Super-Resolution
链接:https://arxiv.org/abs/2603.07430
作者:Lei Jiang,Xin Liu,Xinze Tong,Zhiliang Li,Jie Liu,Jie Tang,Gangshan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reconstruct high-resolution images, degraded low-resolution inputs, Image Super-Resolution, high-resolution images, aims to reconstruct
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content, from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.
194. 【2603.07414】QdaVPR: A novel query-based domain-agnostic model for visual place recognition
链接:https://arxiv.org/abs/2603.07414
作者:Shanshan Wan,Lai Kang,Yingmei Wei,Tianrui Shen,Haixuan Wang,Chao Zuo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual place recognition, Visual place, VPR, visual features, place recognition
备注:
点击查看摘要
Abstract:Visual place recognition (VPR) aiming at predicting the location of an image based solely on its visual features is a fundamental task in robotics and autonomous systems. Domain variation remains one of the main challenges in VPR and is relatively unexplored. Existing VPR models attempt to achieve domain agnosticism either by training on large-scale datasets that inherently contain some domain variations, or by being specifically adapted to particular target domains. In practice, the former lacks explicit domain supervision, while the latter generalizes poorly to unseen domain shifts. This paper proposes a novel query-based domain-agnostic VPR model called QdaVPR. First, a dual-level adversarial learning framework is designed to encourage domain invariance for both the query features forming the global descriptor and the image features from which these query features are derived. Then, a triplet supervision based on query combinations is designed to enhance the discriminative power of the global descriptors. To support the learning process, we augment a large-scale VPR dataset using style transfer methods, generating various synthetic domains with corresponding domain labels as auxiliary supervision. Extensive experiments show that QdaVPR achieves state-of-the-art performance on multiple VPR benchmarks with significant domain variations. Specifically, it attains the best Recall@1 and Recall@10 on nearly all test scenarios: 93.5%/98.6% on Nordland (seasonal changes), 97.5%/99.0% on Tokyo24/7 (day-night transitions), and the highest Recall@1 across almost all weather conditions on the SVOX dataset. Our code will be released at this https URL.
195. 【2603.07406】UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration
链接:https://arxiv.org/abs/2603.07406
作者:Debabrata Mandal,Soumitri Chattopadhyay,Yujie Wang,Marc Niethammer,Praneeth Chakravarthula
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:recover clean images, arbitrary real-world degradations, aims to recover, recover clean, arbitrary real-world
备注:
点击查看摘要
Abstract:Universal image restoration aims to recover clean images from arbitrary real-world degradations using a single inference model. Despite significant progress, existing all-in-one restoration networks do not scale to multiple degradations. As the number of degradations increases, training becomes unstable, models grow excessively large, and performance drops across both seen and unseen domains. In this work, we show that scaling universal restoration is fundamentally limited by interference across degradations during joint learning, leading to catastrophic task forgetting. To address this challenge, we introduce a unified inference pipeline with a multi-branch mixture-of-experts architecture that decomposes restoration knowledge across specialized task-adaptable experts. Our approach enables scalable learning (over sixteen degradations), adapts and generalizes robustly to unseen domains, and supports user-controllable restoration across degradations. Beyond achieving superior performance across benchmarks, this work establishes a new design paradigm for scalable and controllable universal image restoration.
196. 【2603.07403】Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models
链接:https://arxiv.org/abs/2603.07403
作者:Anastasiia Sukhanova,Aiden Taylor,Julian Myers,Zichun Wang,Kartha Veerya Jammuladinne,Satya Sri Rajiteswari Nimmagadda,Aniruddha Maiti,Ananya Jana
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made significant advances, Digital dentistry, captions, dental image analysis, dental
备注: Accepted to IEEE International Conference on Semantic Computing (IEEE ICSC 2026)
点击查看摘要
Abstract:Digital dentistry has made significant advances with the advent of deep learning. However, the majority of these deep learning-based dental image analysis models focus on very specific tasks such as tooth segmentation, tooth detection, cavity detection, and gingivitis classification. There is a lack of a specialized model that has holistic knowledge of teeth and can perform dental image analysis tasks based on that knowledge. Datasets of dental images with captions can help build such a model. To the best of our knowledge, existing dental image datasets with captions are few in number and limited in scope. In many of these datasets, the captions describe the entire mouth, while the images are limited to the anterior view. As a result, posterior teeth such as molars are not clearly visible, limiting the usefulness of the captions for training vision-language models. Additionally, the captions focus only on a specific disease (gingivitis) and do not provide a holistic assessment of each tooth. Moreover, tooth disease scores are typically assigned to individual teeth, and each tooth is treated as a separate entity in orthodontic procedures. Therefore, it is important to have captions for single-tooth images. As far as we know, no such dataset of single-tooth images with dental captions exists. In this work, we aim to bridge that gap by assessing the possibility of generating captions for dental images using Vision-Language Models (VLMs) and evaluating the extent and quality of those captions. Our findings suggest that guided prompts help VLMs generate meaningful captions. We show that the prompts generated by our framework are better anchored in describing the visual aspects of dental images. We selected RGB images as they have greater potential in consumer scenarios.
197. 【2603.07401】VIVECaption: A Split Approach to Caption Quality Improvement
链接:https://arxiv.org/abs/2603.07401
作者:Varun Ananth,Baqiao Liu,Haoran Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Caption quality, critical bottleneck, caption quality improvement, Caption, quality
备注:
点击查看摘要
Abstract:Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a gold-standard dataset creation methodology using stratified sampling and (2) a model alignment strategy encompassing context alignment and parameter-level finetuning using SFT. We demonstrate our methodology on open-source models, focusing on structured caption formats that enable better parsing and downstream utilization. We ultimately show that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality. Our work addresses the growing need for high-quality "vegan" training data in enterprise AI development, providing practical solutions for teams seeking to improve caption-image alignment without relying on potentially copyright-protected web-scraped content.
198. 【2603.07399】Interpretable Aneurysm Classification via 3D Concept Bottleneck Models: Integrating Morphological and Hemodynamic Clinical Features
链接:https://arxiv.org/abs/2603.07399
作者:Toqa Khaled,Ahmad Al-Kabbany
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:compromising clinical transparency, assessing intracranial aneurysms, challenge of reliably, reliably classifying, classifying and assessing
备注:
点击查看摘要
Abstract:We are concerned with the challenge of reliably classifying and assessing intracranial aneurysms using deep learning without compromising clinical transparency. While traditional black-box models achieve high predictive accuracy, their lack of inherent interpretability remains a significant barrier to clinical adoption and regulatory approval. Explainability is paramount in medical modeling to ensure that AI-driven diagnoses align with established neurosurgical principles. Unlike traditional eXplainable AI (XAI) methods -- such as saliency maps, which often provide post-hoc, non-causal visual correlations -- Concept Bottleneck Models (CBMs) offer a robust alternative by constraining the model's internal logic to human-understandable clinical indices. In this article, we propose an end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification. We implemented this pipeline using a pre-trained 3D ResNet-34 backbone and a 3D DenseNet-121 to extract features from CTA volumes, which were subsequently processed through a soft bottleneck layer representing human-interpretable clinical concepts. The model was optimized using a joint-loss function to balance diagnostic focal loss and concept mean squared error (MSE), validated via stratified five-fold cross-validation. Our results demonstrate a peak task classification accuracy of 93.33% +/- 4.5% for the ResNet-34 architecture and 91.43% +/- 5.8% for the DenseNet-121 model. Furthermore, the implementation of 8-pass Test-Time Augmentation (TTA) yielded a robust mean accuracy of 88.31%, ensuring diagnostic stability during inference. By maintaining an accuracy-generalization gap of less than 0.04, this framework proves that high predictive performance can be achieved without sacrificing interpretability.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
Cite as:
arXiv:2603.07399 [cs.CV]
(or
arXiv:2603.07399v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07399
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
199. 【2603.07394】AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
链接:https://arxiv.org/abs/2603.07394
作者:Jihyoung Jang,Hyounghun Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Visual Question Answering, Visual Question, Ambiguous Visual Question, Question Answering, core task
备注: ICLR 2026 (28 pages); Project website: [this https URL](https://aqua-iclr2026.github.io/)
点击查看摘要
Abstract:Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
200. 【2603.07361】N-Tree Diffusion for Long-Horizon Wildfire Risk Forecasting
链接:https://arxiv.org/abs/2603.07361
作者:Yucheng Xing,Xin Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:sparse event supervision, maintaining computational efficiency, forecasting requires generating, requires generating probabilistic, multiple prediction horizons
备注: 15 pages, 6 figures
点击查看摘要
Abstract:Long-horizon wildfire risk forecasting requires generating probabilistic spatial fields under sparse event supervision while maintaining computational efficiency across multiple prediction horizons. Extending diffusion models to multi-step forecasting typically repeats the denoising process independently for each horizon, leading to redundant computation. We introduce N-Tree Diffusion (NT-Diffusion), a hierarchical diffusion model designed for long-horizon wildfire risk forecasting. Fire occurrences are represented as continuous Fire Risk Maps (FRMs), which provide a smoothed spatial risk field suitable for probabilistic modeling. Instead of running separate diffusion trajectories for each predicted timestamp, NT-Diffusion shares early denoising stages and branches at later levels, allowing horizon-specific refinement while reducing redundant sampling. We evaluate the proposed framework on a newly collected real-world wildfire dataset constructed for long-horizon probabilistic prediction. Results indicate that NT-Diffusion achieves consistent accuracy improvements and reduced inference cost compared to baseline forecasting approaches.
201. 【2603.07356】AgrI Challenge: A Data-Centric AI Competition for Cross-Team Validation in Agricultural Vision
链接:https://arxiv.org/abs/2603.07356
作者:Mohammed Brahimi,Karim Laabassi,Mohamed Seghir Hadj Ameur,Aicha Boutorh,Badia Siab-Farsi,Amin Khouani,Omar Farouk Zouak,Seif Eddine Bouziane,Kheira Lakhdari,Abdelkader Nabil Benghanem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Machine learning, deployment environments, Machine learning models, fail to generalize, generalize under real
备注: 17 pages, 8 figures, 6 tables. Introduces the AgrI Challenge dataset containing 50,673 field images of six tree species collected by twelve independent teams
点击查看摘要
Abstract:Machine learning models in agricultural vision often achieve high accuracy on curated datasets but fail to generalize under real field conditions due to distribution shifts between training and deployment environments. Moreover, most machine learning competitions focus primarily on model design while treating datasets as fixed resources, leaving the role of data collection practices in model generalization largely unexplored. We introduce the AgrI Challenge, a data-centric competition framework in which multiple teams independently collect field datasets, producing a heterogeneous multi-source benchmark that reflects realistic variability in acquisition conditions. To systematically evaluate cross-domain generalization across independently collected datasets, we propose Cross-Team Validation (CTV), an evaluation paradigm that treats each team's dataset as a distinct domain. CTV includes two complementary protocols: Train-on-One-Team-Only (TOTO), which measures single-source generalization, and Leave-One-Team-Out (LOTO), which evaluates collaborative multi-source training. Experiments reveal substantial generalization gaps under single-source training: models achieve near-perfect validation accuracy yet exhibit validation-test gaps of up to 16.20% (DenseNet121) and 11.37% (Swin Transformer) when evaluated on datasets collected by other teams. In contrast, collaborative multi-source training dramatically improves robustness, reducing the gap to 2.82% and 1.78%, respectively. The challenge also produced a publicly available dataset of 50,673 field images of six tree species collected by twelve independent teams, providing a diverse benchmark for studying domain shift and data-centric learning in agricultural vision.
Comments:
17 pages, 8 figures, 6 tables. Introduces the AgrI Challenge dataset containing 50,673 field images of six tree species collected by twelve independent teams
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.07356 [cs.CV]
(or
arXiv:2603.07356v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07356
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Mohammed Brahimi [view email] [v1]
Sat, 7 Mar 2026 21:40:34 UTC (642 KB)
202. 【2603.07338】A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction
链接:https://arxiv.org/abs/2603.07338
作者:Murat Arda Onsu,Poonam Lohan,Burak Kantarci,Aisha Syed,Matthew Andrews,Sean Kennedy
类目:Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI); Robotics (cs.RO); Signal Processing (eess.SP)
关键词:Intelligent Transportation Systems, Transportation Systems, Intelligent Transportation, management in Intelligent, motion estimation
备注: 6 pages, 2 figures, IEEE ICC 2026 Workshops (under submission)
点击查看摘要
Abstract:Vehicle tracking, motion estimation, and collision prediction are fundamental components of traffic safety and management in Intelligent Transportation Systems (ITS). Many recent approaches rely on computationally intensive prediction models, which limits their practical deployment on resource-constrained edge devices. This paper presents a lightweight digital-twin-based framework for vehicle tracking and spatiotemporal collision prediction that relies solely on object detection, without requiring complex trajectory prediction networks. The framework is implemented and evaluated in Quanser Interactive Labs (QLabs), a high-fidelity digital twin of an urban traffic environment that enables controlled and repeatable scenario generation. A YOLO-based detector is deployed on simulated edge cameras to localize vehicles and extract frame-level centroid trajectories. Offline path maps are constructed from multiple traversals and indexed using K-D trees to support efficient online association between detected vehicles and road segments. During runtime, consistent vehicle identifiers are maintained, vehicle speed and direction are estimated from the temporal evolution of path indices, and future positions are predicted accordingly. Potential collisions are identified by analyzing both spatial proximity and temporal overlap of predicted future trajectories. Our experimental results across diverse simulated urban scenarios show that the proposed framework predicts approximately 88% of collision events prior to occurrence while maintaining low computational overhead suitable for edge deployment. Rather than introducing a computationally intensive prediction model, this work introduces a lightweight digital-twin-based solution for vehicle tracking and collision prediction, tailored for real-time edge deployment in ITS.
203. 【2603.07314】Faster-HEAL: An Efficient and Privacy-Preserving Collaborative Perception Framework for Heterogeneous Autonomous Vehicles
链接:https://arxiv.org/abs/2603.07314
作者:Armin Maleki,Hayder Radha
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:improving situational awareness, Collaborative perception, promising paradigm, paradigm for improving, improving situational
备注: Accepted to appear in the 2026 IEEE Intelligent Vehicles Symposium (IV 2026), Detroit, MI, USA, June 22-25, 2026. 6 pages, 1 figure, 4 tables
点击查看摘要
Abstract:Collaborative perception (CP) is a promising paradigm for improving situational awareness in autonomous vehicles by overcoming the limitations of single-agent perception. However, most existing approaches assume homogeneous agents, which restricts their applicability in real-world scenarios where vehicles use diverse sensors and perception models. This heterogeneity introduces a feature domain gap that degrades detection performance. Prior works address this issue by retraining entire models/major components, or using feature interpreters for each new agent type, which is computationally expensive, compromises privacy, and may reduce single-agent accuracy. We propose Faster-HEAL, a lightweight and privacy-preserving CP framework that fine-tunes a low-rank visual prompt to align heterogeneous features with a unified feature space while leveraging pyramid fusion for robust feature aggregation. This approach reduces the trainable parameters by 94%, enabling efficient adaptation to new agents without retraining large models. Experiments on the OPV2V-H dataset show that Faster-HEAL improves detection performance by 2% over state-of-the-art methods with significantly lower computational overhead, offering a practical solution for scalable heterogeneous CP.
204. 【2603.07307】StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models
链接:https://arxiv.org/abs/2603.07307
作者:Duy M. H. Nguyen,Tuan A. Tran,Duong Nguyen,Siwei Xie,Trung Q. Nguyen,Mai T. N. Truong,Daniel Palenicek,An T. Le,Michael Barz,TrungTin Nguyen,Tuan Dam,Ngan Le,Minh Vu,Khoa Doan,Vien Ngo,Pengtao Xie,James Zou,Daniel Sonntag,Jan Peters,Mathias Niepert
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision Transformers, techniques for Vision, provide substantial speedups, Recent token merging, processed by self-attention
备注: Firsrt version
点击查看摘要
Abstract:Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.
205. 【2603.07302】raining for Trustworthy Saliency Maps: Adversarial Training Meets Feature-Map Smoothing
链接:https://arxiv.org/abs/2603.07302
作者:Dipkamal Bhusal,Md Tanvirul Alam,Nidhi Rastogi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vanilla Gradient, Integrated Gradients, explain image classifiers, Gradient-based saliency methods, Gradient-based saliency
备注:
点击查看摘要
Abstract:Gradient-based saliency methods such as Vanilla Gradient (VG) and Integrated Gradients (IG) are widely used to explain image classifiers, yet the resulting maps are often noisy and unstable, limiting their usefulness in high-stakes settings. Most prior work improves explanations by modifying the attribution algorithm, leaving open how the training procedure shapes explanation quality. We take a training-centered view and first provide a curvature-based analysis linking attribution stability to how smoothly the input-gradient field varies locally. Guided by this connection, we study adversarial training and identify a consistent trade-off: it yields sparser and more input-stable saliency maps, but can degrade output-side stability, causing explanations to change even when predictions remain unchanged and logits vary only slightly. To mitigate this, we propose augmenting adversarial training with a lightweight feature-map smoothing block that applies a differentiable Gaussian filter in an intermediate layer. Across FMNIST, CIFAR-10, and ImageNette, our method preserves the sparsity benefits of adversarial training while improving both input-side stability and output-side stability. A human study with 65 participants further shows that smoothed adversarial saliency maps are perceived as more sufficient and trustworthy. Overall, our results demonstrate that explanation quality is critically shaped by training, and that simple smoothing with robust training provides a practical path toward saliency maps that are both sparse and stable.
206. 【2603.07294】MAviS: A Multimodal Conversational Assistant For Avian Species
链接:https://arxiv.org/abs/2603.07294
作者:Yevheniia Kryklyvets,Mohammed Irfan Kurpath,Sahal Shaji Mullappilly,Jinxing Zhou,Fahad Shabzan Khan,Rao Anwer,Salman Khan,Hisham Cholakkal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:advancing biodiversity conservation, multimodal question answering, vital for advancing, advancing biodiversity, biodiversity conservation
备注: EMNLP 2025
点击查看摘要
Abstract:Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.
207. 【2603.07291】Virtual Try-On for Cultural Clothing: A Benchmarking Study
链接:https://arxiv.org/abs/2603.07291
作者:Muhammad Tausif Ul Islam,Shahir Awlad,Sameen Yeaser Adib,Md. Atiqur Rahman,Sabbir Ahmed,Md. Hasanul Kabir
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse clothing styles, culturally diverse clothing, made significant progress, generalize culturally diverse, virtual try-on systems
备注: 8 pages, 4 figures
点击查看摘要
Abstract:Although existing virtual try-on systems have made significant progress with the advent of diffusion models, the current benchmarks of these models are based on datasets that are dominant in western-style clothing and female models, limiting their ability to generalize culturally diverse clothing styles. In this work, we introduce BD-VITON, a virtual try-on dataset focused on Bangladeshi garments, including saree, panjabi and salwar kameez, covering both male and female categories as well. These garments present unique structural challenges such as complex draping, asymmetric layering, and high deformation complexities which are underrepresented in the original VITON dataset. To establish strong baselines, we retrain and evaluate try-on models, namely StableViton, HR-VITON, and VITON-HD on our dataset. Our experiments demonstrate consistent improvements in terms of both quantitative and qualitative analysis, compared to zero shot inference.
208. 【2603.07276】Variational Flow Maps: Make Some Noise for One-Step Conditional Generation
链接:https://arxiv.org/abs/2603.07276
作者:Abbas Mammadov,So Takao,Bohan Chen,Ricardo Baptista,Morteza Mardani,Yee Whye Teh,Julius Berner
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:enable high-quality image, high-quality image generation, maps enable high-quality, single forward pass, Flow maps enable
备注:
点击查看摘要
Abstract:Flow maps enable high-quality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forth Variational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn a noise adapter model that outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise-data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well-calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models. Code is available at this https URL
209. 【2603.07246】LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture
链接:https://arxiv.org/abs/2603.07246
作者:Erik Scheurer,Rocco Sedona,Stefan Kesselheim,Gabriele Cavallaro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Geospatial foundation models, remote sensing data, foundation models provide, large-scale satellite remote, satellite remote sensing
备注:
点击查看摘要
Abstract:Geospatial foundation models provide precomputed embeddings that serve as compact feature vectors for large-scale satellite remote sensing data. While these embeddings can reduce data-transfer bottlenecks and computational costs, Earth observation (EO) applications can still face geometric mismatches between user-defined areas of interest and the fixed precomputed embedding grid. Standard latent-space interpolation is unreliable in this setting because the embedding manifold is highly non-convex, yielding representations that do not correspond to realistic inputs. We verify this using Prithvi-EO-2.0 to understand the shortcomings of interpolation applied to patch embeddings. As a substitute, we propose a Learned Equivariance-Predicting Architecture (LEPA). Instead of averaging vectors, LEPA conditions a predictor on geometric augmentations to directly predict the transformed embedding. We evaluate LEPA on NASA/USGS Harmonized Landsat-Sentinel (HLS) imagery and ImageNet-1k. Experiments show that standard interpolation achieves a mean reciprocal rank (MRR) below 0.2, whereas LEPA increases MRR to over 0.8, enabling accurate geometric adjustment without re-encoding.
210. 【2603.07244】PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation
链接:https://arxiv.org/abs/2603.07244
作者:Xin-Sheng Chen,Jiayu Zhu,Pei-lin Li,Hanzheng Wang,Shuojin Yang,Meng-Hao Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Nano Banana Pro, medium for conveying, conveying information, information in presentation-oriented, presentation-oriented scenarios
备注: 27 pages, 9 figures
点击查看摘要
Abstract:Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.
211. 【2603.07240】FabricGen: Microstructure-Aware Woven Fabric Generation
链接:https://arxiv.org/abs/2603.07240
作者:Yingjie Tang,Di Luo,Zixiong Wang,Xiaoli Ling,jian Yang,Beibei Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:involves multiple stages, typically involves multiple, Woven fabric materials, multiple stages, designing realistic
备注: 10 pages, 11 figures
点击查看摘要
Abstract:Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.
212. 【2603.07236】HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing
链接:https://arxiv.org/abs/2603.07236
作者:Tencent HY Team
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long time horizons, deployed systems expected, Foundation models, time horizons, transitioning from offline
备注:
点击查看摘要
Abstract:Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.
213. 【2603.07234】Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion
链接:https://arxiv.org/abs/2603.07234
作者:Heidari Maryam,Anantrasirichai Nantheera,Achim Alin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high frequency, recover high frequency, high frequency structure, high frequency details, effectiveness of super
备注: 17 pages
点击查看摘要
Abstract:The effectiveness of super resolution (SR) models hinges on their ability to recover high frequency structure without introducing artifacts. Diffusion based approaches have recently advanced the state of the art in SR. However, most diffusion based SR pipelines operate purely in the spatial domain, which may yield high frequency details that are not well supported by the underlying low resolution evidence. On the other hand, unlike supervised SR models that may inject dataset specific textures, single image SR relies primarily on internal image statistics and can therefore be less prone to dataset-driven hallucinations; nevertheless, ambiguity in the LR observation can still lead to inconsistent high frequency details. To tackle this problem, we introduce BATDiff, an unsupervised Bivariate A trous Wavelet Diffusion model designed to provide structured cross scale guidance during the generative process. BATDiff employs an a Trous wavelet transform that constructs an undecimated multiscale representation in which high frequency components are progressively revealed while the full spatial resolution is preserved. As the core inference mechanism, BATDiff includes a bivariate cross scale module that models parent child dependencies between adjacent scales. It improves high frequency coherence and reduces mismatch artifacts in diffusion based SR. Experiments on standard benchmarks demonstrate that BATDiff produces sharper and more structurally consistent reconstructions than existing diffusion and non diffusion baselines, achieving improvements in fidelity and perceptual quality.
214. 【2603.07228】LightMedSeg: Lightweight 3D Medical Image Segmentation with Learned Spatial Anchors
链接:https://arxiv.org/abs/2603.07228
作者:Kavyansh Tyagi,Vishwas Rathi,Puneet Goyal
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:data availability constraints, Accurate and efficient, stringent memory, availability constraints, essential for clinical
备注: 8 pages, X figures. Submitted to CVPRW ECV 2026
点击查看摘要
Abstract:Accurate and efficient 3D medical image segmentation is essential for clinical AI, where models must remain reliable under stringent memory, latency, and data availability constraints. Transformer-based methods achieve strong accuracy but suffer from excessive parameters, high FLOPs, and limited generalization. We propose LightMedSeg, a modular UNet-style segmentation architecture that integrates anatomical priors with adaptive context modeling. Anchor-conditioned FiLM modulation enables anatomy-aware feature calibration, while a local structural prior module and texture-aware routing dynamically allocate representational capacity to boundary-rich regions. Computational redundancy is minimized through ghost and depthwise convolutions, and multi-scale features are adaptively fused via a learned skip router with anchor-relative spatial position bias. Despite requiring only 0.48M parameters and 14.64~GFLOPs, LightMedSeg achieves segmentation accuracy within a few Dice points of heavy transformer baselines. Therefore, LightMedSeg is a deployable and data-efficient solution for 3D medical image segmentation. Code will be released publicly upon acceptance.
215. 【2603.07222】VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization
链接:https://arxiv.org/abs/2603.07222
作者:Seul-Ki Yeom,Marcel Simon,Eunbin Lee,Tae-Ho Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:made rapid progress, contextual shortcuts-background textures, Self-supervised learning, rapid progress, made rapid
备注: 18 pages, 2 Tables, 3 Figures
点击查看摘要
Abstract:Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.
Comments:
18 pages, 2 Tables, 3 Figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.07222 [cs.CV]
(or
arXiv:2603.07222v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.07222
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
216. 【2603.07195】Shaping Parameter Contribution Patterns for Out-of-Distribution Detection
链接:https://arxiv.org/abs/2603.07195
作者:Haonan Xu,Yang Yang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:well-known challenge due, parameter contribution patterns, contribution patterns, well-known challenge, challenge due
备注:
点击查看摘要
Abstract:Out-of-distribution (OOD) detection is a well-known challenge due to deep models often producing overconfident. In this paper, we reveal a key insight that trained classifiers tend to rely on sparse parameter contribution patterns, meaning that only a few dominant parameters drive predictions. This brittleness can be exploited by OOD inputs that anomalously trigger these parameters, resulting in overconfident predictions. To address this issue, we propose a simple yet effective method called Shaping Parameter Contribution Patterns (SPCP), which enhances OOD detection robustness by encouraging the classifier to learn boundary-oriented dense contribution patterns. Specifically, SPCP operates during training by rectifying excessively high parameter contributions based on a dynamically estimated threshold. This mechanism promotes the classifier to rely on a broader set of parameters for decision-making, thereby reducing the risk of overconfident predictions caused by anomalously triggered parameters, while preserving in-distribution (ID) performance. Extensive experiments under various OOD detection setups verify the effectiveness of SPCP.
217. 【2603.07192】FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis
链接:https://arxiv.org/abs/2603.07192
作者:Sungwoong Yune,Suheon Jeong,Joo-Young Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Autoregressive modeling, Spacetime Autoregressive modeling, highly efficient alternative, Visual Autoregressive, Autoregressive modeling
备注:
点击查看摘要
Abstract:Visual Autoregressive modeling (VAR) has emerged as a highly efficient alternative to diffusion-based frameworks, achieving comparable synthesis quality. However, as this paradigm extends to Spacetime Autoregressive modeling (STAR) for video generation, scaling resolution and frame counts leads to a "token explosion" that creates a massive computational bottleneck in the final refinement stages. To address this, we propose FastSTAR, a training-free acceleration framework designed for high-quality video generation. Our core method, Spatiotemporal Token Pruning, identifies essential tokens by integrating two specialized terms: (1) Spatial similarity, which evaluates structural convergence across hierarchical scales to skip computations in regions where further refinement becomes redundant, and (2) Temporal similarity, which identifies active motion trajectories by assessing feature-level variations relative to the preceding clip. Combined with a Partial Update mechanism, FastSTAR ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations. Experimental results on InfinityStar demonstrate that FastSTAR achieves up to a 2.01x speedup with a PSNR of 28.29 and less than 1% performance degradation, proving a superior efficiency-quality trade-off for STAR-based video synthesis.
218. 【2603.07181】FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation
链接:https://arxiv.org/abs/2603.07181
作者:Jiaxu Zhou,Shaobo Wang,Zhiyuan Yang,Zhenjun Yu,Tao Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-Language Navigation aims, aims to enable, understand natural language, Vision-Language Navigation, UAV
备注: 10 pages, 5 figures,
点击查看摘要
Abstract:Vision-Language Navigation aims to enable agents to understand natural language instructions and carry out appropriate navigation actions in real-world environments. Most work focuses on indoor settings, with little research in complex outdoor scenes. Current UAV Vision-and-Language Navigation models typically act as black boxes without explicit reasoning. We introduce FreeFly-thinking, an end-to-end VLN framework that converts the UAV agent's egocentric images and language instructions into a series of actions, inspired by environment of urban architecture proposed by OpenFly. We first construct a UAV dataset for navigation task, and then performing natural language chain of thought. We adopt a two-stage training strategy: Supervised fine-tuning and Reinforcement fine-tuning. Experiments on unseen test demonstrate a strong performance, presenting robustness and efficiency in UAV navigation issue.
219. 【2603.07170】Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology
链接:https://arxiv.org/abs/2603.07170
作者:Marco Gustav,Fabian Wolf,Christina Glasner,Nic G. Reitsam,Stefan Schulz,Kira Aschenbroich,Bruno Märkl,Sebastian Foersch,Jakob Nikolas Kather
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapid adoption, enabled prediction, prediction of molecular, molecular and clinical, clinical biomarkers
备注:
点击查看摘要
Abstract:The rapid adoption of transformer-based models in computational pathology has enabled prediction of molecular and clinical biomarkers from HE whole-slide images, yet interpretability has not kept pace with model complexity. While attribution- and generative-based methods are common, feature visualization approaches such as class visualizations (CVs) and activation atlases (AAs) have not been systematically evaluated for these models. We developed a visualization framework and assessed CVs and AAs for a transformer-based foundation model across tissue and multi-organ cancer classification tasks with increasing label granularity. Four pathologists annotated real and generated images to quantify inter-observer agreement, complemented by attribution and similarity metrics. CVs preserved recognizability for morphologically distinct tissues but showed reduced separability for overlapping cancer subclasses. In tissue classification, agreement decreased from Fleiss k = 0.75 (scans) to k = 0.31 (CVs), with similar trends in cancer subclass tasks. AAs revealed layer-dependent organization: coarse tissue-level concepts formed coherent regions, whereas finer subclasses exhibited dispersion and overlap. Agreement was moderate for tissue classification (k = 0.58), high for coarse cancer groupings (k = 0.82), and low at subclass level (k = 0.11). Atlas separability closely tracked expert agreement on real images, indicating that representational ambiguity reflects intrinsic pathological complexity. Attribution-based metrics approximated expert variability in low-complexity settings, whereas perceptual and distributional metrics showed limited alignment. Overall, concept-level feature visualization reveals structured morphological manifolds in transformer-based pathology models and provides a framework for expert-centered interrogation of learned representations across label granularities.
220. 【2603.07166】ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels
链接:https://arxiv.org/abs/2603.07166
作者:Reo Fukunaga,Soh Yoshida,Mitsuji Muneyasu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:memorizing incorrect labels, Deep neural networks, Deep neural, degrades their generalizability, prone to memorizing
备注:
点击查看摘要
Abstract:Deep neural networks are prone to memorizing incorrect labels during training, which degrades their generalizability. Although recent methods have combined sample selection with semi-supervised learning (SSL) to exploit the memorization effect -- where networks learn from clean data before noisy data -- they cannot correct selection errors once a sample is misclassified. To overcome this, we propose asymmetric co-teaching with different architectures (ACD)-U, an asymmetric co-teaching framework that uses different model architectures and incorporates machine unlearning. ACD-U addresses this limitation through two core mechanisms. First, its asymmetric co-teaching pairs a contrastive language-image pretraining (CLIP)-pretrained vision Transformer with a convolutional neural network (CNN), leveraging their complementary learning behaviors: the pretrained model provides stable predictions, whereas the CNN adapts throughout training. This asymmetry, where the vision Transformer is trained only on clean samples and the CNN is trained through SSL, effectively mitigates confirmation bias. Second, selective unlearning enables post-hoc error correction by identifying incorrectly memorized samples through loss trajectory analysis and CLIP consistency checks, and then removing their influence via Kullback--Leibler divergence-based forgetting. This approach shifts the learning paradigm from passive error avoidance to active error correction. Experiments on synthetic and real-world noisy datasets, including CIFAR-10/100, CIFAR-N, WebVision, Clothing1M, and Red Mini-ImageNet, demonstrate state-of-the-art performance, particularly in high-noise regimes and under instance-dependent noise. The code is publicly available at this https URL.
221. 【2603.07163】PromptGate Client Adaptive Vision Language Gating for Open Set Federated Active Learning
链接:https://arxiv.org/abs/2603.07163
作者:Adea Nesturi,David Dueñas Gaviria,Jiajun Zeng,Shadi Albarqouni
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:resource-constrained institutions demands, institutions demands data-efficient, demands data-efficient learning, data-efficient learning pipelines, Deploying medical
备注: 3 Figures, 2 Tables, 10 pages
点击查看摘要
Abstract:Deploying medical AI across resource-constrained institutions demands data-efficient learning pipelines that respect patient privacy. Federated Learning (FL) enables collaborative medical AI without centralising data, yet real-world clinical pools are inherently open-set, containing out-of-distribution (OOD) noise such as imaging artifacts and wrong modalities. Standard Active Learning (AL) query strategies mistake this noise for informative samples, wasting scarce annotation budgets. We propose PromptGate, a dynamic VLM-gated framework for Open-Set Federated AL that purifies unlabeled pools before querying. PromptGate introduces a federated Class-Specific Context Optimization: lightweight, learnable prompt vectors that adapt a frozen BiomedCLIP backbone to local clinical domains and aggregate globally via FedAvg -- without sharing patient data. As new annotations arrive, prompts progressively sharpen the ID/OOD boundary, turning the VLM into a dynamic gatekeeper that is strategy-agnostic: a plug-and-play pre-selection module enhancing any downstream AL strategy. Experiments on distributed dermatology and breast imaging benchmarks show that while static VLM prompting degrades to 50% ID purity, PromptGate maintains $$95% purity with 98% OOD recall.
222. 【2603.07145】LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models
链接:https://arxiv.org/abs/2603.07145
作者:Zicheng Duan,Jiatong Xia,Zeyu Zhang,Wenbo Zhang,Gengze Zhou,Chenhui Gou,Yefei He,Feng Chen,Xinyu Zhang,Lingqiao Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent generative video, Recent generative, visual environment evolution, simulate visual environment, video world models
备注:
点击查看摘要
Abstract:Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at this https URL.
223. 【2603.07144】CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose
链接:https://arxiv.org/abs/2603.07144
作者:Li Jin,Yuchen Yang,Weikai Chen,Yujie Wang,Dehao Hao,Tanghui Jia,Yingda Yin,Zeyu Hu,Runze Zhang,Keyang Luo,Li Yuan,Long Quan,Xin Wang,Xueying Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:learning systems implicitly, coherent reference frame, systems implicitly assume, learning systems, reference frame
备注:
点击查看摘要
Abstract:3D learning systems implicitly assume that objects occupy a coherent reference frame. Nonetheless, in practice, every asset arrives with an arbitrary global rotation, and models are left to resolve directional ambiguity on their own. This persistent misalignment suppresses pose-consistent generation, and blocks the emergence of stable directional semantics. To address this issue, we construct \methodName{}, a massive canonical 3D dataset of 320K objects over 1,156 categories -- an order-of-magnitude increase over prior work. At this scale, directional semantics become statistically learnable: Canoverse improves 3D generation stability, enables precise cross-modal 3D shape retrieval, and unlocks zero-shot point-cloud orientation estimation even for out-of-distribution data. This is achieved by a new canonicalization framework that reduces alignment from minutes to seconds per object via compact hypothesis generation and lightweight human discrimination, transforming canonicalization from manual curation into a high-throughput data generation pipeline. The Canoverse dataset will be publicly released upon acceptance. Project page: this https URL
224. 【2603.07142】PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
链接:https://arxiv.org/abs/2603.07142
作者:Xijun Lu,Hongying Liu,Fanhua Shang,Yanming Hui,Liang Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:heterogeneous anomalies embedded, complex anatomical structures, faces unique challenges, unique challenges due, detection faces unique
备注: Accepted by CVPR'2026
点击查看摘要
Abstract:Medical image anomaly detection faces unique challenges due to subtle, heterogeneous anomalies embedded in complex anatomical structures. Through systematic Grad-CAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose PDD (Manifold-Prior Diverse Distillation), a framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. Specifically, frozen VMamba-Tiny and wide-ResNet50 encoders provide global contextual and local structural priors, respectively. Their features are unified through a Manifold Matching and Unification (MMU) module, while an Inter-Level Feature Adaption (InA) module enriches intermediate representations. The unified manifold is distilled into two students: one performs layer-wise distillation via InA for local consistency, while the other receives skip-projected representations through a Manifold Prior Affine (MPA) module to capture cross-layer dependencies. A diversity loss prevents representation collapse while maintaining detection sensitivity. Extensive experiments on multiple medical datasets demonstrate that PDD significantly outperforms existing state-of-the-art methods, achieving improvements of up to 11.8%, 5.1%, and 8.5% in AUROC on HeadCT, BrainMRI, and ZhangLab datasets, respectively, and 3.4% in F1 max on the Uni-Medical dataset, establishing new state-of-the-art performance in medical image anomaly detection. The implementation will be released at this https URL
225. 【2603.07135】he Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating
链接:https://arxiv.org/abs/2603.07135
作者:Landi He,Xiaoyu Yang,Lijian Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:carry redundant information, dominate inference cost, Visual tokens dominate, cost in vision-language, carry redundant
备注:
点击查看摘要
Abstract:Visual tokens dominate inference cost in vision-language models (VLMs), yet many carry redundant information. Existing pruning methods alleviate this but typically rely on attention magnitude or similarity scores. We reformulate visual token pruning as capacity constrained communication: given a fixed budget K, the model must allocate limited bandwidth to maximally preserve visual information. We propose AutoSelect, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations. During training, a variance preserving noise gate modulates each token's information flow according to its predicted importance so that gradients propagate through all tokens; a diagonal attention Denoiser then recovers the perturbed representations. At inference, only the Scorer and a hard top-K selection remain, adding negligible latency. On ten VLM benchmarks, AutoSelect retains 96.5% of full model accuracy while accelerating LLM prefill by 2.85x with only 0.69 ms overhead, and transfers to different VLM backbones without architecture-specific tuning. Code is available at this https URL.
226. 【2603.07131】Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
链接:https://arxiv.org/abs/2603.07131
作者:Shuai Lu,Meng Wang,Jia Guo,Jiawei Du,Bo Liu,Shengzhu Yang,Weihang Zhang,Huazhu Fu,Huiqi Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision Language, show immense potential, Large Vision, automated ophthalmic diagnosis, Deep Expert Injection
备注:
点击查看摘要
Abstract:Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.
227. 【2603.07120】Inter-Image Pixel Shuffling for Multi-focus Image Fusion
链接:https://arxiv.org/abs/2603.07120
作者:Huangxing Lin,Rongrong Ma,Cheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:combine multiple partially, multiple partially focused, Multi-focus image fusion, Multi-focus image, aims to combine
备注:
点击查看摘要
Abstract:Multi-focus image fusion aims to combine multiple partially focused images into a single all-in-focus image. Although deep learning has shown promise in this task, its effectiveness is often limited by the scarcity of suitable training data. This paper introduces Inter-image Pixel Shuffling (IPS), a novel method that allows neural networks to learn multi-focus image fusion without requiring actual multi-focus images. IPS reformulates the task as a pixel-wise classification problem, where the goal is to identify the focused pixel from a pixel group at each spatial position. In this method, pixels from a clear optical image are treated as focused, while pixels from a low-pass filtered version of the same image are considered defocused. By randomly shuffling the focused and defocused pixels at identical spatial positions in the original and filtered images, IPS generates training data that preserves spatial structure while mixing focus-defocus information. The model is trained to select the focused pixel from each spatially aligned pixel group, thus learning to reconstruct an all-in-focus image by aggregating sharp content from the input. To further enhance fusion quality, IPS adopts a cross-image fusion network that integrates the localized representation power of convolutional neural networks with the long-range modeling capabilities of state space models. This design effectively leverages both spatial detail and contextual information to produce high-quality fused results. Experimental results indicate that IPS significantly outperforms existing multi-focus image fusion methods, even without training on multi-focus images.
228. 【2603.07119】IQA: Human-Aligned Text Quality Assessment in Generated Images
链接:https://arxiv.org/abs/2603.07119
作者:Kirill Koltsov,Aleksandr Gushchin,Dmitriy Vatolin,Anastasia Antsiferova
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:persistent failure mode, existing evaluations rely, VLM-based judging procedures, perceptual text artifacts, Text rendering remains
备注:
点击查看摘要
Abstract:Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.
229. 【2603.07113】Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning
链接:https://arxiv.org/abs/2603.07113
作者:Wangyu Feng,Shawn Young,Lijian Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Chest X-ray, paradigm for Chest, Self-supervised learning, powerful paradigm, existing SSL
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) has emerged as a powerful paradigm for Chest X-ray (CXR) analysis under limited annotations. Yet, existing SSL strategies remain suboptimal for medical imaging. Masked image modeling allocates substantial computation to reconstructing high-frequency background details with limited diagnostic value. Contrastive learning, on the other hand, often depends on aggressive augmentations that risk altering clinically meaningful structures. We introduce Semantic-Partitioned Contrastive Learning (S-PCL), an efficient pre-training framework tailored for CXR representation learning. Instead of reconstructing pixels or relying on heavy augmentations, S-PCL randomly partitions patch tokens from a single CXR into two non-overlapping semantic subsets. Each subset provides a complementary but incomplete view. The encoder must maximize agreement between these partitions, implicitly inferring global anatomical layout and local pathological cues from partial evidence. This semantic partitioning forms an internal bottleneck that enforces long-range dependency modeling and structural coherence. S-PCL eliminates the need for hand-crafted augmentations, auxiliary decoders, and momentum encoders. The resulting architecture is streamlined, computationally efficient, and easy to scale. Extensive experiments on large-scale CXR benchmarks, including ChestX-ray14, CheXpert, RSNA Pneumonia and SIIM-ACR Pneumothorax, show that S-PCL achieves competitive performance while attaining the lowest GFLOPs and superior accuracy among existing SSL approaches.
230. 【2603.07098】NuNext: Reframing Nucleus Detection as Next-Point Detection
链接:https://arxiv.org/abs/2603.07098
作者:Zhongyi Shui,Honglin Li,Xiaozhong Ji,Ye Zhang,Zijiang Yang,Chenglu Zhu,Yuxuan Sun,Kai Yao,Conghui He,Cheng Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:clinical applications, histopathology is pivotal, wide range, range of clinical, Nucleus detection
备注:
点击查看摘要
Abstract:Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.
231. 【2603.07093】Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction
链接:https://arxiv.org/abs/2603.07093
作者:Xu Chen,Rui Gao,Xinjie Zhang,Haoyu Zhang,Che Sun,Zhi Gao,Yuwei Wu,Yunde Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Achieving natural dyadic, requires generating facial, interaction requires generating, Achieving natural, natural dyadic interaction
备注:
点击查看摘要
Abstract:Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.
232. 【2603.07090】mAVE: A Watermark for Joint Audio-Visual Generation Models
链接:https://arxiv.org/abs/2603.07090
作者:Luyang Si,Leyi Pan,Lijie Wen
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:widespread commercial deployment, ensuring content provenance, Joint Audio-Visual Generation, Audio-Visual Generation Models, commercial deployment
备注:
点击查看摘要
Abstract:As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($99\%$), mAVE offers a robust cryptographic defense for vendor copyright.
233. 【2603.07077】Aligning What EEG Can See: Structural Representations for Brain-Vision Matching
链接:https://arxiv.org/abs/2603.07077
作者:Jingyi Tang,Shuai Jiang,Fei Su,Zhicheng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:non-invasive brain-computer interfaces, highly promising avenue, brain-computer interfaces, promising avenue, avenue for non-invasive
备注:
点击查看摘要
Abstract:Visual decoding from electroencephalography (EEG) has emerged as a highly promising avenue for non-invasive brain-computer interfaces (BCIs). Existing EEG-based decoding methods predominantly align brain signals with the final-layer semantic embeddings of deep visual models. However, relying on these highly abstracted embeddings inevitably leads to severe cross-modal information mismatch. In this work, we introduce the concept of Neural Visibility and accordingly propose the EEG-Visible Layer Selection Strategy, aligning EEG signals with intermediate visual layers to minimize this mismatch. Furthermore, to accommodate the multi-stage nature of human visual processing, we propose a novel Hierarchically Complementary Fusion (HCF) framework that jointly integrates visual representations from different hierarchical levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance, reaching an 84.6% accuracy (+21.4%) on zero-shot visual decoding on the THINGS-EEG dataset. Moreover, our method achieves up to a 129.8% performance gain across diverse EEG baselines, demonstrating its robust generalizability.
234. 【2603.07076】Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network
链接:https://arxiv.org/abs/2603.07076
作者:Shixuan Xu,Yabo Liu,Junyu Dong,Xinghui Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Underwater Image Enhancement, severe degradation caused, Existing Underwater Image, Underwater Image, Image Enhancement Network
备注:
点击查看摘要
Abstract:Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator, a Cross-Modal Text Aligner and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.
235. 【2603.07074】Physics-Guided VLM Priors for All-Cloud Removal
链接:https://arxiv.org/abs/2603.07074
作者:Liying Xu,Huifang Li,Huanfeng Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:optical remote sensing, remote sensing due, fundamental challenge, challenge in optical, optical remote
备注:
点击查看摘要
Abstract:Cloud removal is a fundamental challenge in optical remote sensing due to the heterogeneous degradation. Thin clouds distort radiometry via partial transmission, while thick clouds occlude the surface. Existing pipelines separate thin-cloud correction from thick-cloud reconstruction, requiring explicit cloud-type decisions and often leading to error accumulation and discontinuities in mixed-cloud scenes. Therefore, a novel approach named Physical-VLM All-Cloud Removal (PhyVLM-CR) that integrates the semantic capability of Vision-Language Model (VLM) into a physical restoration model, achieving high-fidelity unified cloud removal. Specifically, the cognitive prior from a VLM (e.g., Qwen) is transformed into physical scattering parameters and a hallucination confidence map. Leveraging this confidence map as a continuous soft gate, our method achieves a unified restoration via adaptive weighting: it prioritizes physical inversion in high-transmission regions to preserve radiometric fidelity, while seamlessly transitioning to temporal reference reconstruction in low-confidence occluded areas. This mechanism eliminates the need for explicit boundary delineation, ensuring a coherent removal across heterogeneous cloud covers. Experiments on real-world Sentinel-2 surface reflectance imagery confirm that our approach achieves a remarkable balance between cloud removal and content preservation, delivering hallucination-free results with substantially improved quantitative accuracy compared to existing methods.
236. 【2603.07071】VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
链接:https://arxiv.org/abs/2603.07071
作者:Xueqing Yu,Bohan Li,Yan Li,Zhenheng Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent Vision-Language Models, made remarkable progress, Recent Vision-Language, understanding remains unreliable, remains unreliable
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.
237. 【2603.07066】MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering
链接:https://arxiv.org/abs/2603.07066
作者:Trong-Thang Pham,Loc Nguyen,Anh Nguyen,Hien Nguyen,Ngan Le
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:imaging data augmentation, Generative diffusion models, medical imaging data, causal training data, produce causal training
备注:
点击查看摘要
Abstract:Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link this https URL
238. 【2603.07057】SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
链接:https://arxiv.org/abs/2603.07057
作者:Tong Shao,Yusen Fu,Guoying Sun,Jingde Kong,Zhuotao Tian,Jingyong Su
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, key bottleneck hindering, low inference efficiency, inference efficiency remains, hindering further advancement
备注: 23 pages, CVPR 2026 accepted
点击查看摘要
Abstract:Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$\alpha$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: this https URL.
239. 【2603.07048】Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation
链接:https://arxiv.org/abs/2603.07048
作者:Xiaochen Yang,Hao Fang,Jiawei Kong,Yaoxin Mao,Bin Chen,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:demonstrated remarkable capabilities, large vision-language models, remarkable capabilities, large vision-language, demonstrated remarkable
备注:
点击查看摘要
Abstract:Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.
240. 【2603.07043】Fine-Grained 3D Facial Reconstruction for Micro-Expressions
链接:https://arxiv.org/abs/2603.07043
作者:Che Sun,Xinjie Zhang,Rui Gao,Xu Chen,Yuwei Wu,Yunde Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, Recent advances, micro-expressions remains unexplored, remains unexplored, demonstrated remarkable
备注:
点击查看摘要
Abstract:Recent advances in 3D facial expression reconstruction have demonstrated remarkable performance in capturing macro-expressions, yet the reconstruction of micro-expressions remains unexplored. This novel task is particularly challenging due to the subtle, transient, and low-intensity nature of micro-expressions, which complicate the extraction of stable and discriminative features essential for accurate reconstruction. In this paper, we propose a fine-grained micro-expression reconstruction method that integrates a global dynamic feature capturing stable facial motion patterns with a locally-enriched feature incorporating multiple informative cues from 2D motions, facial priors and 3D facial geometry. Specifically, we devise a plug-and-play dynamic-encoded module to extract micro-expression feature for global facial action, allowing it to leverage prior knowledge from abundant macro-expression data to mitigate the scarcity of micro-expression data. Subsequently, a dynamic-guided mesh deformation module is designed for extracting aggregated local features from dense optical flow, sparse landmark cues and facial mesh geometry, which adaptively refines fine-grained facial micro-expression without compromising global 3D geometry. Extensive experiments on micro-expression datasets demonstrate that our method consistently outperforms state-of-the-art methods in both geometric accuracy and perceptual detail.
241. 【2603.07028】wo Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking
链接:https://arxiv.org/abs/2603.07028
作者:Moyang Chen,Zonghao Ying,Wenzhuo Xu,Quancheng Zou,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:raising urgent concerns, synthesize complex videos, lightweight natural language, natural language prompts, raising urgent
备注:
点击查看摘要
Abstract:Recent text-to-video (T2V) models can synthesize complex videos from lightweight natural language prompts, raising urgent concerns about safety alignment in the event of misuse in the real world. Prior jailbreak attacks typically rewrite unsafe prompts into paraphrases that evade content filters while preserving meaning. Yet, these approaches often still retain explicit sensitive cues in the input text and therefore overlook a more profound, video-specific weakness. In this paper, we identify a temporal trajectory infilling vulnerability of T2V systems under fragmented prompts: when the prompt specifies only sparse boundary conditions (e.g., start and end frames) and leaves the intermediate evolution underspecified, the model may autonomously reconstruct a plausible trajectory that includes harmful intermediate frames, despite the prompt appearing benign to input or output side filtering. Building on this observation, we propose TFM. This fragmented prompting framework converts an originally unsafe request into a temporally sparse two-frame extraction and further reduces overtly sensitive cues via implicit substitution. Extensive evaluations across multiple open-source and commercial T2V models demonstrate that TFM consistently enhances jailbreak effectiveness, achieving up to a 12% increase in attack success rate on commercial systems. Our findings highlight the need for temporally aware safety mechanisms that account for model-driven completion beyond prompt surface form.
242. 【2603.07022】OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation
链接:https://arxiv.org/abs/2603.07022
作者:Leilei Wang,Longfei Liu,Xi Shen,Xuanlong Yu,Ying Tiffany He,Fei Richard Yu,Yingyi Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:strict latency constraints, Current real-time OVOD, real-time OVOD methods, dynamic environments, essential for practical
备注:
点击查看摘要
Abstract:Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at this https URL.
243. 【2603.06999】rajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models
链接:https://arxiv.org/abs/2603.06999
作者:Jiajun Cheng,Xiaofan Yu,Subarna,Sainan Liu,Shan Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recognizing instruments' interactions, Recognizing instruments', robotic surgery, essential for building, building context-aware
备注:
点击查看摘要
Abstract:Recognizing instruments' interactions with tissues is essential for building context-aware AI assistants in robotic surgery. Vision-language models (VLMs) have opened a new avenue for surgical perception and achieved better generalization on a wide range of tasks compared to conventional task-specific deep learning approaches. However, their performance on instrument--tissue interaction recognition remains limited, largely due to two challenges: (1) many models do not effectively leverage temporal information, and (2) alignment between vision and text often misses fine-grained action details. To address these issues, we propose TrajPred, a framework that encodes instrument trajectories to incorporate temporal motion cues and, conditioned on these trajectories, introduces a predictor module to generate visual semantic embeddings that better capture fine-grained action details. We further incorporate prompt tuning and a verb-rephrasing technique to enable smooth adaptation to the instrument--tissue interaction recognition task. Extensive experiments on the public laparoscopic benchmark, CholecT50, show that our method improves both Average Precision and Top-K accuracy. We also investigate whether visual embeddings of instrument--tissue interaction regions align better with the corresponding text by visualizing the cosine similarity between visual and textual embeddings. The visualization results indicate that the proposed method improves alignment between relevant visual and textual representations.
244. 【2603.06993】AdaGen: Learning Adaptive Policy for Image Synthesis
链接:https://arxiv.org/abs/2603.06993
作者:Zanlin Ni,Yulin Wang,Yeguo Hua,Renping Zhou,Jiayi Guo,Jun Song,Bo Zheng,Gao Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Masked Generative Transformers, rectified flow models, Recent advances, Generative Transformers, powerful generative models
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Journal version of [arXiv:2409.00342](https://arxiv.org/abs/2409.00342) (ECCV 2024). Code is available at: [this https URL](https://github.com/LeapLabTHU/AdaGen)
点击查看摘要
Abstract:Recent advances in image synthesis have been propelled by powerful generative models, such as Masked Generative Transformers (MaskGIT), autoregressive models, diffusion models, and rectified flow models. A common principle behind their success is the decomposition of synthesis into multiple steps. However, this introduces a proliferation of step-specific parameters (e.g., noise level or temperature at each step). Existing approaches typically rely on manually-designed rules to manage this complexity, demanding expert knowledge and trial-and-error. Furthermore, these static schedules lack the flexibility to adapt to the unique characteristics of each sample, yielding sub-optimal performance. To address this issue, we present AdaGen, a general, learnable, and sample-adaptive framework for scheduling the iterative generation process. Specifically, we formulate the scheduling problem as a Markov Decision Process, where a lightweight policy network determines suitable parameters given the current generation state, and can be trained through reinforcement learning. Importantly, we demonstrate that simple reward designs, such as FID or pre-trained reward models, can be easily hacked and may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of the policy networks. Finally, we introduce an inference-time refinement strategy and a controllable fidelity-diversity trade-off mechanism to further enhance the performance and flexibility of AdaGen. Comprehensive experiments on four generative paradigms validate the superiority of AdaGen. For example, AdaGen achieves better performance on DiT-XL with 3 times lower inference cost and improves the FID of VAR from 1.92 to 1.59 with negligible computational overhead.
245. 【2603.06989】MipSLAM: Alias-Free Gaussian Splatting SLAM
链接:https://arxiv.org/abs/2603.06989
作者:Yingzhao Li,Yan Li,Shixiong Tian,Yanjie Liu,Lijun Zhao,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:SLAM framework capable, varying camera configurations, Gaussian Splatting, paper introduces MipSLAM, SLAM framework
备注: Accepted to ICRA 2026
点击查看摘要
Abstract:This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. A novel local frequency-domain perceptual loss is also introduced to enhance fine-grained geometric detail recovery. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions while maintaining real-time capability. Code is available at this https URL.
246. 【2603.06986】ADAS-TO: A Large-Scale Multimodal Naturalistic Dataset and Empirical Characterization of Human Takeovers during ADAS Engagement
链接:https://arxiv.org/abs/2603.06986
作者:Yuhang Wang,Yiyao Xu,Jingran Sun,Hao Zhou
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:key safety vulnerability, existing public resources, public resources rarely, resources rarely provide, rarely provide takeover-centered
备注:
点击查看摘要
Abstract:Takeovers remain a key safety vulnerability in production ADAS, yet existing public resources rarely provide takeover-centered, real-world data. We present ADAS-TO, the first large-scale naturalistic dataset dedicated to ADAS-to-manual transitions, containing 15,659 takeover-centered 20s clips from 327 drivers across 22 vehicle brands. Each clip synchronizes front-view video with CAN logs. Takeovers are defined as ADAS ON $\rightarrow$ OFF transitions, with the primary trigger labeled as brake, steer, gas, mixed, or system disengagement. We further separate planned driver-initiated terminations (Ego) from forced takeovers (Non-ego) using a rule-based partition. While most events occur within conservative kinematic margins, we identify a long tail of 285 safety-critical cases. For these events, we combine kinematic screening with vision--language (VLM) annotation to attribute hazards and relate them to intervention dynamics. The resulting cross-modal analysis shows distinct kinematic signatures across traffic dynamics, infrastructure degradation, and adverse environments, and finds that in 59.3% of critical cases, actionable visual cues emerge at least 3s before takeover, supporting the potential for semantics-aware early warning beyond late-stage kinematic triggers. The dataset is publicly released at this http URL.
247. 【2603.06985】Perception-Aware Multimodal Spatial Reasoning from Monocular Images
链接:https://arxiv.org/abs/2603.06985
作者:Yanchun Cheng,Rundong Wang,Xulei Yang,Alok Prakash,Daniela Rus,Marcelo H Ang Jr,ShiJie Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:current Vision-Language Models, Vision-Language Models, ambiguous object appearance, large scale variation, fine-grained geometric perception
备注:
点击查看摘要
Abstract:Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
248. 【2603.06982】Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning
链接:https://arxiv.org/abs/2603.06982
作者:Paul Julius Kühn,Cedric Spengler,Michael Weinmann,Arjan Kuijper,Saptarshi Neil Sinha
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Image-based shape retrieval, Image-based shape, computer vision, computer graphics, aims to retrieve
备注:
点击查看摘要
Abstract:Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on $Acc_{Top1}$ and $Acc_{Top10}$ for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.
249. 【2603.06973】2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding
链接:https://arxiv.org/abs/2603.06973
作者:Chaohong Guo,Yihan He,Yongwei Nie,Fei Ma,Xuemiao Xu,Chengjiang Long
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language query, Video Temporal Grounding, Temporal Grounding, complex temporal dynamics, Temporal
备注:
点击查看摘要
Abstract:Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing Vision-LMMs typically perceive temporal dynamics via positional encoding, text-based timestamps, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token introduces additional computational overhead and leads to sparsity in visual attention, positional encoding struggles to capture absolute temporal information, and visual frame numbering often compromises spatial detail. To address these issues, we propose Temporal to Spatial Gridification (T2SGrid), a novel framework that reformulates video temporal understanding as a spatial understanding task. The core idea of T2SGrid is to process video content in clips rather than individual frames. we employ a overlapping sliding windows mechanism to segment the video into temporal clips. Within each window, frames are arranged chronologically in a row-major order into a composite grid image, effectively transforming temporal sequences into structured 2D layouts. The gridification not only encodes temporal information but also enhances local attention within each grid. Furthermore, T2SGrid enables the use of composite text timestamps to establish global temporal awareness. Experiments on standard VTG benchmarks demonstrate that T2SGrid achieves superior performance.
250. 【2603.06972】Conditional Unbalanced Optimal Transport Maps: An Outlier-Robust Framework for Conditional Generative Modeling
链接:https://arxiv.org/abs/2603.06972
作者:Jiwoo Yoon,Kyumin Choi,Jaewoong Choi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Unbalanced Optimal Transport, Optimal Transport, Conditional Unbalanced Optimal, Optimal Transport Maps, Conditional Optimal Transport
备注: 15 pages, 5 figures
点击查看摘要
Abstract:Conditional Optimal Transport (COT) problem aims to find a transport map between conditional source and target distributions while minimizing the transport cost. Recently, these transport maps have been utilized in conditional generative modeling tasks to establish efficient mappings between the distributions. However, classical COT inherits a fundamental limitation of optimal transport, i.e., sensitivity to outliers, which arises from the hard distribution matching constraints. This limitation becomes more pronounced in a conditional setting, where each conditional distribution is estimated from a limited subset of data. To address this, we introduce the Conditional Unbalanced Optimal Transport (CUOT) framework, which relaxes conditional distribution-matching constraints through Csiszár divergence penalties while strictly preserving the conditioning marginals. We establish a rigorous formulation of the CUOT problem and derive its dual and semi-dual formulations. Based on the semi-dual form, we propose Conditional Unbalanced Optimal Transport Maps (CUOTM), an outlier-robust conditional generative model built upon a triangular $c$-transform parameterization. We theoretically justify the validity of this parameterization by proving that the optimal triangular map satisfies the $c$-transform relationships. Our experiments on 2D synthetic and image-scale datasets demonstrate that CUOTM achieves superior outlier robustness and competitive distribution-matching performance compared to existing COT-based baselines, while maintaining high sampling efficiency.
251. 【2603.06971】SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation
链接:https://arxiv.org/abs/2603.06971
作者:Kaiyuan Xu,Fangzhou Hong,Daniel Elson,Baoru Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advancing robotic-assisted surgery, Reconstructing surgical scenes, monocular endoscopic video, robotic-assisted surgery, Reconstructing surgical
备注:
点击查看摘要
Abstract:Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: this https URL.
252. 【2603.06956】Virtual Intraoperative CT (viCT): Sequential Anatomic Updates for Modeling Tissue Resection Throughout Endoscopic Sinus Surgery
链接:https://arxiv.org/abs/2603.06956
作者:Nicole M. Gunderson,Graham J. Harris,Jeremy S. Ruthberg,Pengcheng Chen,Di Mao,Randall A. Bly,Waleed M. Abuzeid,Eric J. Seibel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Incomplete dissection, revision endoscopic sinus, endoscopic sinus surgery, monocular endoscopic video, chronic rhinosinusitis
备注:
点击查看摘要
Abstract:Purpose: Incomplete dissection is a common cause of persistent disease and revision endoscopic sinus surgery (ESS) in chronic rhinosinusitis. Current image-guided surgery systems typically reference static preoperative CT (pCT), and do not model evolving resection boundaries. We present Virtual Intraoperative CT (viCT), a method for sequentially updating pCT throughout ESS using intraoperative 3D reconstructions from monocular endoscopic video to enable visualization of evolving anatomy in CT format. Methods: Monocular endoscopic video is processed using a depth-supervised NeRF framework with virtual stereo synthesis to generate metrically scaled 3D reconstructions at multiple surgical intervals. Reconstructions undergo rigid, landmark-based registration in 3D Slicer guided by anatomical correspondences, and are then voxelized into the pCT grid. viCT volumes were generated using a ray-based occupancy comparison between pCT and reconstruction to delete outdated voxels and remap preserved anatomy and updated boundaries. Performance is evaluated in a cadaveric feasibility study of four specimens across four ESS stages using volumetric overlap (DSC, Jaccard) and surface metrics (HD95, Chamfer, MSD, RMSD), and qualitative comparisons to ground-truth CT. Results: viCT updates show agreement with ground-truth anatomy across surgical stages, with submillimeter mean surface errors. Dice Similarity Coefficient (DSC) = 0.88 +/- 0.05 and Jaccard Index = 0.79 +/- 0.07, and Hausdorff Distance 95% (HD95) = 0.69 +/- 0.28 mm, Chamfer Distance = 0.09 +/- 0.05 mm, Mean Surface Distance (MSD) = 0.11 +/- 0.05 mm, and Root Mean Square Distance (RMSD) = 0.32 +/- 0.10 mm. Conclusion: viCT enables CT-format anatomic updating in an ESS setting without ancillary hardware. Future work will focus on fully automating registration, validation in live cases, and optimizing runtime for real-time deployment.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.06956 [cs.CV]
(or
arXiv:2603.06956v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.06956
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Nicole Gunderson [view email] [v1]
Sat, 7 Mar 2026 00:23:05 UTC (18,884 KB)
253. 【2603.06936】Extracting and analyzing 3D histomorphometric features related to perineural and lymphovascular invasion in prostate cancer
链接:https://arxiv.org/abs/2603.06936
作者:Sarah S.L. Chow,Rui Wang,Robert B. Serafin,Yujie Zhao,Elena Baraznenok,Xavier Farré,Jennifer Salguero-Lopez,Gan Gao,Huai-Ching Hsieh,Lawrence D. True,Priti Lal,Anant Madabhushi,Jonathan T.C. Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diagnostic grading, histology sections, grading of prostate, prostate cancer, Diagnostic
备注:
点击查看摘要
Abstract:Diagnostic grading of prostate cancer (PCa) relies on the examination of 2D histology sections. However, the limited sampling of specimens afforded by 2D histopathology, and ambiguities when viewing 2D cross-sections, can lead to suboptimal treatment decisions. Recent studies have shown that 3D histomorphometric analysis of glands and nuclei can improve PCa risk assessment compared to analogous 2D features. Here, we expand on these efforts by developing an analytical pipeline to extract 3D features related to perineural invasion (PNI) and lymphovascular invasion (LVI), which correlate with poor prognosis for a variety of cancers. A 3D segmentation model (nnU-Net) was trained to segment nerves and vessels in 3D datasets of archived prostatectomy specimens that were optically cleared, labeled with a fluorescent analog of HE, and imaged with open-top light-sheet (OTLS) microscopy. PNI- and LVI-related features, including metrics describing cancer-nerve and cancer-vessel proximity, were then extracted based on the 3D nerve/vessel segmentation masks in conjunction with 3D masks of cancer-enriched regions. As a preliminary exploration of the prognostic value of these features, we trained a supervised machine learning classifier to predict 5-year biochemical recurrence (BCR) outcomes, finding that 3D PNI-related features are moderately prognostic and outperform 2D PNI-related features (AUC = 0.71 vs. 0.52). Source code is available at this https URL.
254. 【2603.06932】HIERAMP: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
链接:https://arxiv.org/abs/2603.06932
作者:Lin Zhao,Xinru Jiang,Xi Xiao,Qihui Fan,Lei Lu,Yanzhi Wang,Xue Lin,Octavia Camps,Pu Zhao,Jianyang Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:creating small surrogate, small surrogate datasets, creating small, small surrogate, original large-scale
备注: The paper is accepted by CVPR 2026
点击查看摘要
Abstract:Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original large-scale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HIERAMP to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HIERAMP consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.
255. 【2603.06925】Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images
链接:https://arxiv.org/abs/2603.06925
作者:Qianqian Zhang,Xiaolong Jia,Ahmed M. Abdelmoniem,Li Zhou,Junshe An
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging high-precision detection, weakly textured, challenging high-precision, general algorithms, easily disturbed
备注: The manuscript has been submitted to the journal and is currently under review
点击查看摘要
Abstract:Targets in remote sensing images are usually small, weakly textured, and easily disturbed by complex backgrounds, challenging high-precision detection with general algorithms. Building on our earlier ESM-YOLO, this work presents ESM-YOLO+ as a lightweight visible infrared fusion network. To enhance detection, ESM-YOLO+ includes two key innovations. (1) A Mask-Enhanced Attention Fusion (MEAF) module fuses features at the pixel level via learnable spatial masks and spatial attention, effectively aligning RGB and infrared features, enhancing small-target representation, and alleviating cross-modal misalignment and scale heterogeneity. (2) Training-time Structural Representation (SR) enhancement provides auxiliary supervision to preserve fine-grained spatial structures during training, boosting feature discriminability without extra inference cost. Extensive experiments on the VEDAI and DroneVehicle datasets validate ESM-YOLO+'s superiority. The model achieves 84.71\% mAP on VEDAI and 74.0\% mAP on DroneVehicle, while greatly reducing model complexity, with 93.6\% fewer parameters and 68.0\% lower GFLOPs than the baseline. These results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.
256. 【2603.06920】DLRMamba: Distilling Low-Rank Mamba for Edge Multispectral Fusion Object Detection
链接:https://arxiv.org/abs/2603.06920
作者:Qianqian Zhang,Leon Tabaro,Ahmed M. Abdelmoniem,Junshe An
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multispectral fusion object, edge-based maritime surveillance, high inference efficiency, Multispectral fusion, fusion object detection
备注: Has been submitted to the IEEE TGRS journal
点击查看摘要
Abstract:Multispectral fusion object detection is a critical task for edge-based maritime surveillance and remote sensing, demanding both high inference efficiency and robust feature representation for high-resolution inputs. However, current State Space Models (SSMs) like Mamba suffer from significant parameter redundancy in their standard 2D Selective Scan (SS2D) blocks, which hinders deployment on resource-constrained hardware and leads to the loss of fine-grained structural information during conventional compression. To address these challenges, we propose the Low-Rank Two-Dimensional Selective Structured State Space Model (Low-Rank SS2D), which reformulates state transitions via matrix factorization to exploit intrinsic feature sparsity. Furthermore, we introduce a Structure-Aware Distillation strategy that aligns the internal latent state dynamics of the student with a full-rank teacher model to compensate for potential representation degradation. This approach substantially reduces computational complexity and memory footprint while preserving the high-fidelity spatial modeling required for object recognition. Extensive experiments on five benchmark datasets and real-world edge platforms, such as Raspberry Pi 5, demonstrate that our method achieves a superior efficiency-accuracy trade-off, significantly outperforming existing lightweight architectures in practical deployment scenarios.
257. 【2603.06917】PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
链接:https://arxiv.org/abs/2603.06917
作者:Zhengjian Kang,Jun Zhuang,Kangtong Mo,Qi Chen,Rui Liu,Ye Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Detection Transformer, redefined object detection, set prediction task, prediction task, detection by casting
备注: 10 pages, 6 figures
点击查看摘要
Abstract:Detection Transformer (DETR) has redefined object detection by casting it as a set prediction task within an end-to-end framework. Despite its elegance, DETR and its variants still rely on fixed learnable queries and suffer from severe query utilization imbalance, which limits adaptability and leaves the model capacity underused. We propose PaQ-DETR (Pattern and Quality-Aware DETR), a unified framework that enhances both query adaptivity and supervision balance. It learns a compact set of shared latent patterns capturing global semantics and dynamically generates image-specific queries through content-conditioned weighting. In parallel, a quality-aware one-to-many assignment strategy adaptively selects positive samples based on localizatio-classification consistency, enriching supervision and promoting balanced query optimization. Experiments on COCO, CityScapes, and other benchmarks show consistent gains of 1.5%-4.2% mAP across DETR backbones, including ResNet and Swin-Transformer. Beyond accuracy improvement, our method provides interpretable insights into how dynamic patterns cluster semantically across object categories.
258. 【2603.06894】Learning From Design Procedure To Generate CAD Programs for Data Augmentation
链接:https://arxiv.org/abs/2603.06894
作者:Yan-Ying Chen,Dule Shu,Matthew Hong,Andrew Taber,Jonathan Li,Matthew Klenk
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Large Language, demonstrated impressive capabilities, CAD, Language Models
备注: Accepted by NeurIPS 2025 Workshop: Deep Learning for Code in the Agentic Era
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of code generation tasks. However, generating code for certain domains remains challenging. One such domain is Computer-Aided Design (CAD) program, where the goal is to produce scripted parametric models that define object geometry for precise design and manufacturing applications. A key challenge in LLM-based CAD program generation is the limited geometric complexity of generated shapes compared to those found in real-world industrial designs. This shortfall is in part due to the lack of diversity in the available CAD program training data. To address this, we propose a novel data augmentation paradigm that prompts an LLM to generate CAD programs conditioned on a reference surface program and a modeling procedure - an idea inspired by practices in industrial design. By varying the reference surface using a collection of organic shapes, our method enriches the geometric distribution of generated CAD models. In particular, it introduces edges and faces defined by spline-based curvature, which are typically missing or underrepresented in existing open-source CAD program datasets. Experiments show that our method produces CAD samples with significantly greater geometric diversity and a higher resemblance to industry-grade CAD designs in terms of the proportion of organic shape primitives. This enhancement makes our CAD data augmentation approach a useful tool for training LLMs and other deep learning models in CAD generation.
259. 【2603.06885】OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation
链接:https://arxiv.org/abs/2603.06885
作者:Kibrom Gebremedhin,Hadush Hailu,Bruk Gebregziabher
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Sub-Saharan Africa bearing, Sub-Saharan Africa, Africa bearing, burden and Ethiopia, Ethiopia alone accounting
备注: 9 figure, 3 tables
点击查看摘要
Abstract:Trachoma remains the leading infectious cause of blindness worldwide, with Sub-Saharan Africa bearing over 85% of the global burden and Ethiopia alone accounting for more than half of all cases. Yet publicly available preprocessed datasets for automated trachoma classification are scarce, and none originate from the most affected region. Raw clinical photographs of eyelids contain significant background noise that hinders direct use in machine learning pipelines. We present OPTED, an open-source preprocessed trachoma eye dataset constructed using the Segment Anything Model 3 (SAM 3) for automated region-of-interest extraction. We describe a reproducible four-step pipeline: (1) text-prompt-based zero-shot segmentation of the tarsal conjunctiva using SAM 3, (2) background removal and bounding-box cropping with alignment, (3) quality filtering based on confidence scores, and (4) Lanczos resizing to 224x224 pixels. A separate prompt-selection stage identifies the optimal text prompt, and manual quality assurance verifies outputs. Through comparison of five candidate prompts on all 2,832 known-label images, we identify "inner surface of eyelid with red tissue" as optimal, achieving a mean confidence of 0.872 (std 0.070) and 99.5% detection rate (the remaining 13 images are recovered via fallback prompts). The pipeline produces outputs in two formats: cropped and aligned images preserving the original aspect ratio, and standardized 224x224 images ready for pre-trained architectures. The OPTED dataset, preprocessing code, and all experimental artifacts are released as open source to facilitate reproducible trachoma classification research.
260. 【2603.06873】PICS: Pairwise Image Compositing with Spatial Interactions
链接:https://arxiv.org/abs/2603.06873
作者:Hang Zhou,Xinxin Zuo,Sen Wang,Li Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:strong single-turn performance, disrupt physical consistency, preserve coherent spatial, coherent spatial relations, overwrite previously generated
备注: ICLR 2026. Project page: [this https URL](https://ryanhangzhou.github.io/pics/) , code: [this https URL](https://github.com/RyanHangZhou/PICS)
点击查看摘要
Abstract:Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive {\alpha}-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at this https URL
261. 【2603.06863】A prior information informed learning architecture for flying trajectory prediction
链接:https://arxiv.org/abs/2603.06863
作者:Xianda Huang,Zidong Han,Ruibo Jin,Zhenyu Wang,Wenyu Li,Xiaoyang Li,Yi Gong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:analytics to aerospace, flying objects, domains ranging, ranging from sports, sports analytics
备注:
点击查看摘要
Abstract:Trajectory prediction for flying objects is critical in domains ranging from sports analytics to aerospace. However, traditional methods struggle with complex physical modeling, computational inefficiencies, and high hardware demands, often neglecting critical trajectory events like landing points. This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture. We demonstrate this approach by predicting the landing points of tennis balls in real-world outdoor courts. Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates. These coordinates, fused with structural environmental priors (e.g., court boundaries), form a comprehensive dataset fed into our proposed DTC model. A first-level Transformer classifies the trajectory, while a second-level Transformer synthesizes these features to precisely predict the landing point. Extensive ablation and comparative experiments demonstrate that integrating environmental priors within the DTC architecture significantly outperforms existing trajectory prediction frameworks
262. 【2603.06861】IGLU: The Integrated Gaussian Linear Unit Activation Function
链接:https://arxiv.org/abs/2603.06861
作者:Mingi Kang,Zai Yang,Jeova Farias Sales Rocha Neto
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:deep neural networks, governing gradient flow, optimization stability, neural networks, representational capacity
备注:
点击查看摘要
Abstract:Activation functions are fundamental to deep neural networks, governing gradient flow, optimization stability, and representational capacity. Within historic deep architectures, while ReLU has been the dominant choice for the activation function, modern transformer-based models increasingly are adopting smoother alternatives such as GELU and other self-gated alternatives. Despite their empirical success, the mathematical relationships among these functions and the principles underlying their effectiveness remains only partially understood. We introduce IGLU, a parametric activation function derived as a scale mixture of GELU gates under a half-normal mixing distribution. This derivation yields a closed-form expression whose gating component is exactly the Cauchy CDF, providing a principled one-parameter family that continuously interpolates between identity-like and ReLU-like behavior via a single sharpness parameter $\sigma$. Unlike GELU's Gaussian gate, IGLU's heavy-tailed Cauchy gate decays polynomially in the negative tail, guaranteeing non-zero gradients for all finite inputs and offering greater robustness to vanishing gradients. We further introduce IGLU-Approx, a computationally efficient rational approximation of IGLU expressed entirely in terms of ReLU operations that eliminates transcendental function evaluation. Through evaluations on CIFAR-10, CIFAR-100, and WikiText-103 across ResNet-20, ViT-Tiny, and GPT-2 Small, IGLU achieves competitive or superior performance on both vision and language datasets against ReLU and GELU baselines, with IGLU-Approx recovering this performance at substantially reduced computational cost. In particular, we show that employing a heavy-tailed gate leads to considerable performance gains in heavily imbalanced classification datasets.
263. 【2603.06860】ColonSplat: Reconstruction of Peristaltic Motion in Colonoscopy with Dynamic Gaussian Splatting
链接:https://arxiv.org/abs/2603.06860
作者:Weronika Smolak-Dyżewska,Joanna Kaleta,Diego Dall'Alba,Przemysław Spurek
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex peristaltic movements, advanced surgical navigation, colonoscopy data, accounting for complex, peristaltic movements
备注:
点击查看摘要
Abstract:Accurate 3D reconstruction of colonoscopy data, accounting for complex peristaltic movements, is crucial for advanced surgical navigation and retrospective diagnostics. While recent novel view synthesis and 3D reconstruction methods have demonstrated remarkable success in general endoscopic scenarios, they struggle in the highly constrained environment of the colon. Due to the limited field of view of a camera moving through an actively deforming tubular structure, existing endoscopic methods reconstruct the colon appearance only for initial camera trajectory. However, the underlying anatomy remains largely static; instead of updating Gaussians' spatial coordinates (xyz), these methods encode deformation through either rotation, scale or opacity adjustments. In this paper, we first present a benchmark analysis of state-of-the-art dynamic endoscopic methods for realistic colonoscopic scenes, showing that they fail to model true anatomical motion. To enable rigorous evaluation of global reconstruction quality, we introduce DynamicColon, a synthetic dataset with ground-truth point clouds at every timestep. Building on these insights, we propose ColonSplat, a dynamic Gaussian Splatting framework that captures peristaltic-like motion while preserving global geometric consistency, achieving superior geometric fidelity on C3VDv2 and DynamicColon datasets. Project page: this https URL
264. 【2603.06853】An Extended Topological Model For High-Contrast Optical Flow
链接:https://arxiv.org/abs/2603.06853
作者:Brad Turow,Jose A. Perea
类目:Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT)
关键词:dense core subsets, high-contrast optical flow, optical flow patches, optical flow torus, flow patches sampled
备注: 28 pages, 31 figures
点击查看摘要
Abstract:In this paper, we identify low-dimensional models for dense core subsets in the space of $3\times 3$ high-contrast optical flow patches sampled from the Sintel dataset. In particular, we leverage the theory of approximate and discrete circle bundles to identify a 3-manifold whose boundary is a previously proposed optical flow torus, together with disjoint circles corresponding to pairs of binary step-edge range image patches. The 3-manifold model we introduce provides an explanation for why the previously-proposed torus model could not be verified with direct methods (e.g., a straightforward persistent homology computation). We also demonstrate that nearly all optical flow patches in the top 1 percent by contrast norm are found near the family of binary step-edge circles described above, rather than the optical flow torus, and that these frequently occurring patches are concentrated near motion boundaries (which are of particular importance for computer vision tasks such as object segmentation and tracking). Our findings offer insights on the subtle interplay between topology and geometry in inference for visual data.
265. 【2603.06852】Active View Selection with Perturbed Gaussian Ensemble for Tomographic Reconstruction
链接:https://arxiv.org/abs/2603.06852
作者:Yulun Wu,Ruyi Zha,Wei Cao,Yingying Li,Yuanhao Cai,Yaoyao Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Sparse-view computed tomography, reducing radiation exposure, X-ray Gaussian Splatting, active view selection, computed tomography
备注:
点击查看摘要
Abstract:Sparse-view computed tomography (CT) is critical for reducing radiation exposure to patients. Recent advances in radiative 3D Gaussian Splatting (3DGS) have enabled fast and accurate sparse-view CT reconstruction. Despite these algorithmic advancements, practical reconstruction fidelity remains fundamentally bounded by the quality of the captured data, raising the crucial yet underexplored problem of X-ray active view selection. Existing active view selection methods are primarily designed for natural-light scenes and fail to capture the unique geometric ambiguities and physical attenuation properties inherent in X-ray imaging. In this paper, we present Perturbed Gaussian Ensemble, an active view selection framework that integrates uncertainty modeling with sequential decision-making, tailored for X-ray Gaussian Splatting. Specifically, we identify low-density Gaussian primitives that are likely to be uncertain and apply stochastic density scaling to construct an ensemble of plausible Gaussian density fields. For each candidate projection, we measure the structural variance of the ensemble predictions and select the one with the highest variance as the next best view. Extensive experimental results on arbitrary-trajectory CT benchmarks demonstrate that our density-guided perturbation strategy effectively eliminates geometric artifacts and consistently outperforms existing baselines in progressive tomographic reconstruction under unified view selection protocols.
266. 【2603.06846】MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies
链接:https://arxiv.org/abs/2603.06846
作者:Howard H. Qian,Kejia Ren,Yu Xiang,Vicente Ordonez,Kaiyu Hang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Rigid bodies constitute, smallest manipulable elements, Rigid bodies, moving rigid bodies, real world
备注: 23 pages, 18 figures
点击查看摘要
Abstract:Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.
267. 【2603.06828】Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
链接:https://arxiv.org/abs/2603.06828
作者:Md Ashikur Rahman,Md Arifur Rahman,Niamul Hassan Samin,Abdullah Ibne Hanif Arean,Juena Ahmed Noshin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:maintain temporally grounded, temporally grounded beliefs, grounded beliefs generalize, maintain temporally, temporally grounded
备注:
点击查看摘要
Abstract:We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $\rho = 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).
268. 【2603.06803】A Hybrid Machine Learning Model for Cerebral Palsy Detection
链接:https://arxiv.org/abs/2603.06803
作者:Karan Kumar Singh,Nikita Gajbhiye,Gouri Sankar Mishra
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Cerebral Palsy, treatments for Cerebral, proposed model, development of effective, effective treatments
备注: 28 pages, 19 figures, 8 tables. This manuscript is based on the article published in the International Journal of Intelligent Systems and Applications in Engineering (IJISAE), 2024. The arXiv version is provided for open accessibility and wider dissemination
点击查看摘要
Abstract:The development of effective treatments for Cerebral Palsy (CP) can begin with the early identification of affected children while they are still in the early stages of the disorder. Pathological issues in the brain can be better diagnosed with the use of one of many medical imaging techniques. Magnetic Resonance Imaging (MRI) has revolutionized medical imaging with its unparalleled image resolution. A unique Machine Learning (ML) model that was built to identify CP disorder is presented in this paper. The model is intended to assist in the early diagnosis of CP in newborns. In this study, the brain MRI images dataset was first collected, and then the preprocessing techniques were applied to this dataset to make it ready for use in the proposed model. Following this, the proposed model was constructed by combining three CNN models, specifically VGG 19, Efficient-Net, and the ResNet50 model, to extract features from the image. Following this, a Bi-LSTM was utilized as a classifier to determine whether or not CP was present, and finally, the proposed model was employed for training and testing. The results show that the proposed model achieved an accuracy of 98.83%, which is higher than VGG-19 (96.79%), Efficient-Net (97.29%), and VGG-16 (97.50%).. When the suggested model is compared to other models that have been pre-trained in the past, the accuracy scores seem to be much higher.
269. 【2603.06753】EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track
链接:https://arxiv.org/abs/2603.06753
作者:Zhenyuan Chen,Guanyuan Shen,Feng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, multi-modal aerial-view analysis, comprehensive multi-modal aerial-view
备注: tech report
点击查看摘要
Abstract:Cross-modal image-to-image translation among Electro-Optical (EO), Infrared (IR), and Synthetic Aperture Radar (SAR) sensors is essential for comprehensive multi-modal aerial-view analysis. However, translating between these modalities is notoriously difficult due to their distinct electromagnetic signatures and geometric characteristics. This paper presents \textbf{EarthBridge}, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T). We explore two distinct methodologies: \textbf{Diffusion Bridge Implicit Models (DBIM)}, which we generalize using non-Markovian bridge processes for high-quality deterministic sampling, and \textbf{Contrastive Unpaired Translation (CUT)}, which utilizes contrastive learning for structural consistency. Our EarthBridge framework employs a channel-concatenated UNet denoiser trained with Karras-weighted bridge scalings and a specialized "booting noise" initialization to handle the inherent ambiguity in cross-modal mappings. We evaluate these methods across all four challenge tasks (SAR$\rightarrow$EO, SAR$\rightarrow$RGB, SAR$\rightarrow$IR, RGB$\rightarrow$IR), achieving superior spatial detail and spectral accuracy. Our solution achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard. Code is available at this https URL.
270. 【2603.06750】XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification
链接:https://arxiv.org/abs/2603.06750
作者:Tapon Kumer Ray,Rajkumar Y,Shalini R,Srigayathri K,Jayashree S,Lokeswari P
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Plant disease classification, Convolutional Neural Network, light-weight Convolutional Neural, Plant disease, precision agriculture
备注: 14 pages, 8 figures, Conference Paper
点击查看摘要
Abstract:Plant disease classification via imaging is a critical task in precision agriculture. We propose XMACNet, a novel light-weight Convolutional Neural Network (CNN) that integrates self-attention and multi-modal fusion of visible imagery and vegetation indices for chili disease detection. XMACNet uses an EfficientNetV2S backbone enhanced by a self-attention module and a fusion branch that processes both RGB images and computed vegetation index maps (NDVI, NPCI, MCARI). We curated a new dataset of 12,000 chili leaf images across six classes (five disease types plus healthy), augmented synthetically via StyleGAN to mitigate data scarcity. Trained on this dataset, XMACNet achieves high accuracy, F1-score, and AUC, outperforming baseline models such as ResNet-50, MobileNetV2, and a Swin Transformer variant. Crucially, XMACNet is explainable: we use Grad-CAM++ and SHAP to visualize and quantify the models focus on disease features. The models compact size and fast inference make it suitable for edge deployment in real-world farming scenarios.
271. 【2603.06746】ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers
链接:https://arxiv.org/abs/2603.06746
作者:Aryan Karmore
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deploying sparse Mixture, Vision Transformers remains, Deploying sparse, sparse Mixture, Transformers remains
备注:
点击查看摘要
Abstract:Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.
272. 【2603.06741】Heterogeneous Decentralized Diffusion Models
链接:https://arxiv.org/abs/2603.06741
作者:Zhiying Jiang,Raihan Seraj,Marcos Villagra,Bidhan Roy
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:tightly coupled clusters, frontier-scale diffusion models, requires substantial computational, substantial computational resources, computational resources concentrated
备注: Accepted to CVPR2026
点击查看摘要
Abstract:Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time via a deterministic schedule-aware conversion into a common velocity space without retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces compute from 1176 to 72 GPU-days (16x) and data from 158M to 11M (14x). Under aligned inference settings, our heterogeneous 2DDPM:6FM configuration achieves better FID (11.88 vs. 12.45) and higher intra-prompt diversity (LPIPS 0.631 vs. 0.617) than the homogeneous 8FM baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework lowers infrastructure requirements for decentralized generative model training.
273. 【2603.06735】Vessel-Aware Deep Learning for OCTA-Based Detection of AMD
链接:https://arxiv.org/abs/2603.06735
作者:Margalit G. Mitzner,Moinak Bhattacharya,Zhilin Zou,Chao Chen,Prateek Prasanna
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Age-related macular degeneration, coherence tomography angiography, early micro-vascular alterations, optical coherence tomography, exploit clinically meaningful
备注:
点击查看摘要
Abstract:Age-related macular degeneration (AMD) is characterized by early micro-vascular alterations that can be captured non-invasively using optical coherence tomography angiography (OCTA), yet most deep learning (DL) models rely on global features and fail to exploit clinically meaningful vascular biomarkers. We introduce an external multiplicative attention framework that incorporates vessel-specific tortuosity maps and vasculature dropout maps derived from arteries, veins, and capillaries. These biomarker maps are generated from vessel segmentations and smoothed across multiple spatial scales to highlight coherent patterns of vascular remodeling and capillary rarefaction. Tortuosity reflects abnormalities in vessel geometry linked to impaired auto-regulation, while dropout maps capture localized perfusion deficits that precede structural retinal damage. The maps are fused with the OCTA projection to guide a deep classifier toward physiologically relevant regions. Arterial tortuosity provided the most consistent discriminative value, while capillary dropout maps performed best among density-based variants, especially at larger smoothing scales. Our proposed method offers interpretable insights aligned with known AMD pathophysiology.
274. 【2603.06732】HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
链接:https://arxiv.org/abs/2603.06732
作者:Tingting Han,Xinsong Tao,Yufei Yin,Min Tan,Sicheng Zhao,Zhou Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Temporal Sentence Grounding, natural language query, Temporal Sentence, temporally localize segments, Sentence Grounding
备注:
点击查看摘要
Abstract:Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.
275. 【2603.06723】UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms
链接:https://arxiv.org/abs/2603.06723
作者:Xiang Ao,Yiling Du,Zidan Wang,Mengru Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:media and AIGC, image copyright protection, essential technology, widely deployed, rapid development
备注: 26 pages, 7 figures
点击查看摘要
Abstract:Invisible watermarks, as an essential technology for image copyright protection, have been widely deployed with the rapid development of social media and AIGC. However, existing invisible watermark detection heavily relies on prior knowledge of specific algorithms, leading to limited detection capabilities for "unknown watermarks" in open environments. To this end, we propose a novel task named Universal Watermark Presence Detection (UWPD), which aims to identify whether an image carries a copyright mark without requiring decoding information. We construct the UniFreq-100K dataset, comprising large-scale samples across various invisible watermark embedding algorithms. Furthermore, we propose the Frequency Shield Network (FSNet). This model deploys an Adaptive Spectral Perception Module (ASPM) in the shallow layers, utilizing learnable frequency gating to dynamically amplify high-frequency watermark signals while suppressing low-frequency semantics. In the deep layers, the network introduces Dynamic Multi-Spectral Attention (DMSA) combined with tri-stream extremum pooling to deeply mine watermark energy anomalies, forcing the model to precisely focus on sensitive frequency bands. Extensive experiments demonstrate that FSNet exhibits superior zero-shot detection capabilities on the UWPD task, outperforming existing baseline models. Code and datasets will be released upon acceptance.
276. 【2603.06704】On the Generalization Capacities of MLLMs for Spatial Intelligence
链接:https://arxiv.org/abs/2603.06704
作者:Gongjie Zhang,Wenhao Li,Quanhao Qian,Jiuniu Wang,Deli Zhao,Shijian Lu,Ran Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multimodal Large Language, Large Language Models, process RGB inputs, directly process RGB, Multimodal Large
备注: ICLR 2026 (Oral)
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these RGB-only approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.
277. 【2603.06700】SIQA: Toward Reliable Scientific Image Quality Assessment
链接:https://arxiv.org/abs/2603.06700
作者:Wenzhe Li,Liang Chen,Junying Wang,Yijing Guo,Ye Shen,Farong Wen,Chunyi Li,Zicheng Zhang,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:depict visual scenes, encode structured domain, image quality assessment, images fundamentally differ, Scientific Image Quality
备注:
点击查看摘要
Abstract:Scientific images fundamentally differ from natural and AI-generated images in that they encode structured domain knowledge rather than merely depict visual scenes. Assessing their quality therefore requires evaluating not only perceptual fidelity but also scientific correctness and logical completeness. However, existing image quality assessment (IQA) paradigms primarily focus on perceptual distortions or image-text alignment, implicitly assuming that depicted content is factually valid. This assumption breaks down in scientific contexts, where visually plausible figures may still contain conceptual errors or incomplete reasoning. To address this gap, we introduce Scientific Image Quality Assessment (SIQA), a framework that models scientific image quality along two complementary dimensions: Knowledge (Scientific Validity and Scientific Completeness) and Perception (Cognitive Clarity and Disciplinary Conformity). To operationalize this formulation, we design two evaluation protocols: SIQA-U (Understanding), which measures semantic comprehension of scientific content through multiple-choice tasks, and SIQA-S (Scoring), which evaluates alignment with expert quality judgments. We further construct the SIQA Challenge, consisting of an expert-annotated benchmark and a large-scale training set. Experiments across representative multimodal large language models (MLLMs) reveal a consistent discrepancy between scoring alignment and scientific understanding. While models can achieve strong agreement with expert ratings under SIQA-S, their performance on SIQA-U remains substantially lower. Fine-tuning improves both metrics, yet gains in scoring consistently outpace improvements in understanding. These results suggest that rating consistency alone may not reliably reflect scientific comprehension, underscoring the necessity of multidimensional evaluation for scientific image quality assessment.
278. 【2603.06699】Multi-label Instance-level Generalised Visual Grounding in Agriculture
链接:https://arxiv.org/abs/2603.06699
作者:Mohammadreza Haghighat,Alzayat Saleh,Mostafa Rahimi Azghadi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding field imagery, distinguishing individual crop, Understanding field, distinguishing individual, central challenge
备注:
点击查看摘要
Abstract:Understanding field imagery such as detecting plants and distinguishing individual crop and weed instances is a central challenge in precision agriculture. Despite progress in vision-language tasks like captioning and visual question answering, Visual Grounding (VG), localising language-referred objects, remains unexplored in agriculture. A key reason is the lack of suitable benchmark datasets for evaluating grounding models in field conditions, where many plants look highly similar, appear at multiple scales, and the referred target may be absent from the image. To address these limitations, we introduce gRef-CW, the first dataset designed for generalised visual grounding in agriculture, including negative expressions. Benchmarking current state-of-the-art grounding models on gRef-CW reveals a substantial domain gap, highlighting their inability to ground instances of crops and weeds. Motivated by these findings, we introduce Weed-VG, a modular framework that incorporates multi-label hierarchical relevance scoring and interpolation-driven regression. Weed-VG advances instance-level visual grounding and provides a clear baseline for developing VG methods in precision agriculture. Code will be released upon acceptance.
279. 【2603.06698】Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer
链接:https://arxiv.org/abs/2603.06698
作者:Kabir Thayani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe geometric constraints, learned representation space, induces severe geometric, global Vision Transformers, Knowledge distillation
备注: 3 pages, 3 figures, 1 table
点击查看摘要
Abstract:Knowledge distillation between asymmetric architectures often induces severe geometric constraints on the learned representation space. In this work, we investigate the Dimensional Collapse phenomenon when distilling global Vision Transformers (CLIP and DINOv2) into capacity-constrained CNNs. By employing strictly centered SVD and Effective Rank, we first demonstrate a capacity-agnostic phase transition on CIFAR-10 where standard cosine distillation collapses representations to an intrinsic Effective Rank of ~16. To reverse this, we integrate an auxiliary contrastive objective (InfoNCE), expanding the student's manifold by 2.4x (to ~38 effective dimensions). We further demonstrate that while DINOv2's uniform geometry partially prevents collapse, contrastive expansion remains a universal requirement to reach the CNN's topological capacity limit (~82 dimensions). Finally, we reveal a critical capacity-density trade-off: overparameterization within fixed manifolds induces brittleness, while capacity-constrained models act as optimal low-pass semantic filters, successfully recovering inherent noise immunity.
280. 【2603.06697】hinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs
链接:https://arxiv.org/abs/2603.06697
作者:Yiwei Li,Zihao Wu,Yanjun Lv,Hanqi Jiang,Weihang You,Zhengliang Liu,Dajiang Zhu,Xiang Li,Quanzheng Li,Tianming Liu,Lin Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:grounded radiology tasks, radiology tasks, Vision, language models, visually grounded radiology
备注:
点击查看摘要
Abstract:Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.
281. 【2603.06696】HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training
链接:https://arxiv.org/abs/2603.06696
作者:Hwihun Jeong,Qiang Liu,Kathryn E. Keenan,Elisabeth A. Wilde,Walter Schneider,Sudhir Pathak,Anthony Zuccolotto,Lauren J. O'Donnell,Lipeng Ning,Yogesh Rathi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:confounds subsequent analysis, multi-site diffusion MRI, Combining multi-site diffusion, diffusion MRI, Combining multi-site
备注:
点击查看摘要
Abstract:Purpose: Combining multi-site diffusion MRI (dMRI) data is hindered by inter-scanner variability, which confounds subsequent analysis. Previous harmonization methods require large, matched or traveling human subjects from multiple sites, which are impractical to acquire in many situations. This study aims to develop a deep learning-based dMRI harmonization framework that eliminates the reliance on multi-site in-vivo traveling human data for training. Methods: HARP employs a voxel-wise 1D neural network trained on an easily transportable diffusion phantom. The model learns relationships between spherical harmonics coefficients of different sites without memorizing spatial structures. Results: HARP reduced inter-scanner variability levels significantly in various measures. Quantitatively, it decreased inter-scanner variability as measured by standard error in FA (12%), MD (10%), and GFA (30%) with scan-rescan standard error as the baseline, while preserving fiber orientations and tractography after harmonization. Conclusion: We believe that HARP represents an important first step toward dMRI harmonization using only phantom data, thereby obviating the need for complex, matched in vivo multi-site cohorts. This phantom-only strategy substantially enhances the feasibility and scalability of quantitative dMRI for large-scale clinical studies.
282. 【2603.06693】Soft Equivariance Regularization for Invariant Self-Supervised Learning
链接:https://arxiv.org/abs/2603.06693
作者:Joohyung Lee,Changhun Kim,Hyunsu Kim,Kwanhyung Lee,Juho Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Self-supervised learning, semantic-preserving augmentations, invariant to semantic-preserving, typically learns representations, SER
备注: 14th International Conference on Learning Representations (ICLR 2026)
点击查看摘要
Abstract:Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $\rho_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.
283. 【2603.06691】One-Shot Badminton Shuttle Detection for Mobile Robots
链接:https://arxiv.org/abs/2603.06691
作者:Florentin Dipner,William Talbot,Turcan Tuna,Andrei Cramariuc,Marco Hutter
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:robust one-shot badminton, one-shot badminton shuttlecock, robust one-shot, one-shot badminton, framework for non-stationary
备注: Under review for IEEE R-AP
点击查看摘要
Abstract:This paper presents a robust one-shot badminton shuttlecock detection framework for non-stationary robots. To address the lack of egocentric shuttlecock detection datasets, we introduce a dataset of 20,510 semi-automatically annotated frames captured across 11 distinct backgrounds in diverse indoor and outdoor environments, and categorize each frame into one of three difficulty levels. For labeling, we present a novel semi-automatic annotation pipeline, that enables efficient labeling from stationary camera footage. We propose a metric suited to our downstream use case and fine-tune a YOLOv8 network optimized for real-time shuttlecock detection, achieving an F1-score of 0.86 under our metric in test environments similar to training, and 0.70 in entirely unseen environments. Our analysis reveals that detection performance is critically dependent on shuttlecock size and background texture complexity. Qualitative experiments confirm their applicability to robots with moving cameras. Unlike prior work with stationary camera setups, our detector is specifically designed for the egocentric, dynamic viewpoints of mobile robots, providing a foundational building block for downstream tasks, including tracking, trajectory estimation, and system (re)-initialization.
284. 【2603.06690】Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind
链接:https://arxiv.org/abs/2603.06690
作者:Julia Anna Leonardi,Johannes Jakubik,Paolo Fraccaro,Maria Antonia Brovelli
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Geospatial Foundation Models, Geospatial Foundation, Hyperspectral Imaging, typically lack native, typically lack
备注: Accepted to ICLR 2026 Machine Learning for Remote Sensing (ML4RS) Workshop
点击查看摘要
Abstract:Geospatial Foundation Models (GFMs) typically lack native support for Hyperspectral Imaging (HSI) due to the complexity and sheer size of high-dimensional spectral data. This study investigates the adaptability of TerraMind, a multimodal GFM, to address HSI downstream tasks \emph{without} HSI-specific pretraining. Therefore, we implement and compare two channel adaptation strategies: Naive Band Selection and physics-aware Spectral Response Function (SRF) grouping. Overall, our results indicate a general superiority of deep learning models with native support of HSI data. Our experiments also demonstrate the ability of TerraMind to adapt to HSI downstream tasks through band selection with moderate performance decline. Therefore, the findings of this research establish a critical baseline for HSI integration, motivating the need for native spectral tokenization in future multimodal model architectures.
285. 【2603.06689】High-Resolution Image Reconstruction with Unsupervised Learning and Noisy Data Applied to Ion-Beam Dynamics for Particle Accelerators
链接:https://arxiv.org/abs/2603.06689
作者:Francis Osswald(IPHC),Mohammed Chahbaoui(UNISTRA),Xinyi Liang(SU)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:challenging inverse problem, high-energy physics accelerators, severe degradation remains, inverse problem, physics accelerators
备注:
点击查看摘要
Abstract:Image reconstruction in the presence of severe degradation remains a challenging inverse problem, particularly in beam diagnostics for high-energy physics accelerators. As modern facilities demand precise detection of beam halo structures to control losses, traditional analysis tools have reached their performance limits. This work reviews existing image-processing techniques for data cleaning, contour extraction, and emittance reconstruction, and introduces a novel approach based on convolutional filtering and neural networks with optimized early-stopping strategies in order to control overfitting. Despite the absence of training datasets, the proposed unsupervised framework achieves robust denoising and high-fidelity reconstruction of beam emittance images under low signal-to-noise conditions. The method extends measurable amplitudes beyond seven standard deviations, enabling unprecedented halo resolution.
286. 【2603.06688】Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
链接:https://arxiv.org/abs/2603.06688
作者:Zhengjian Yao,Yongzhi Li,Xinyuan Gao,Quan Chen,Peng Jiang,Yanye Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:consistent visual content, Narrative Weaver, framework that addresses, addresses a fundamental, fundamental challenge
备注:
点击查看摘要
Abstract:We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.
287. 【2603.06687】meSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings
链接:https://arxiv.org/abs/2603.06687
作者:Azmine Toushik Wasi,Shahriyar Zaman Ridoy,Koushik Ahamed Tonmoy,Kinga Tshering,S. M. Muhtasimul Hasan,Wahid Faisal,Tasnim Mohiuddin,Md Rizwan Parvez
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Multimedia (cs.MM); Robotics (cs.RO)
关键词:traffic planning, embodied navigation, world modeling, infer location, underpins applications
备注: 66 Pages. In Review
点击查看摘要
Abstract:Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding. TimeSpot is available at: this https URL.
288. 【2603.06684】hree-dimensional reconstruction and segmentation of an aggregate stockpile for size and shape analyses
链接:https://arxiv.org/abs/2603.06684
作者:Erol Tutumluer,Haohang Huang,Jiayi Luo,Issam Qamhia,John M. Hart
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:transportation geotechnics applications, geotechnics applications, key properties, properties for determining, road construction
备注: 7 pages, 4 figures, Proceedings of the 20th International Conference on Soil Mechanics and Geotechnical Engineering
点击查看摘要
Abstract:Aggregate size and shape are key properties for determining quality of aggregate materials used in road construction and transportation geotechnics applications. The composition and packing, layer stiffness, and load response are all influenced by these morphological characteristics of aggregates. Many aggregate imaging systems developed to date only focus on analyses of individual or manually separated aggregate particles. There is a need to develop a convenient and affordable system for acquiring 3D aggregate information from stockpiles in the field. This paper presents an innovative 3D imaging approach for potential field evaluation of large-sized aggregates, whereby engineers can perform inspection by taking videos/images with mobile devices such as smartphone cameras. The approach leverages Structure-from-Motion (SfM) techniques to reconstruct the stockpile surface as 3D spatial data, i.e. point cloud, and uses a 3D segmentation algorithm to separate and extract individual aggregates from the reconstructed stockpile. The preliminary results presented in this paper demonstrate the future potential of using 3D aggregate size and shape information for onsite Quality Assurance/Quality Control (QA/QC) tasks.
289. 【2603.06683】ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction
链接:https://arxiv.org/abs/2603.06683
作者:Hailong Chu,Shuo Zhang,Yunlong Chu,Shutai Huang,Xingyue Zhang,Tinghe Yan,Jinsong Zhang,Lei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimedia Event Extraction, involves extracting structured, Large Language Model, extracting structured event, structured event records
备注:
点击查看摘要
Abstract:Multimedia Event Extraction (M2E2) involves extracting structured event records from both textual and visual content. Existing approaches, ranging from specialized architectures to direct Large Language Model (LLM) prompting, typically rely on a linear, end-to-end generation and thus suffer from cascading errors: early cross-modal misalignments often corrupt downstream role assignment under strict grounding constraints. We propose ECHO (Event-Centric Hypergraph Operations), a multi-agent framework that iteratively refines a shared Multimedia Event Hypergraph (MEHG), which serves as an explicit intermediate structure for multimodal event hypotheses. Unlike dialogue-centric frameworks, ECHO coordinates specialized agents by applying atomic hypergraph operations to the MEHG. Furthermore, we introduce a Link-then-Bind strategy that enforces deferred commitment: agents first identify relevant arguments and only then determine their precise roles, mitigating incorrect grounding and limiting error propagation. Extensive experiments on the M2E2 benchmark show that ECHO significantly outperforms the state-of-the-art (SOTA) : with Qwen3-32B, it achieves a 7.3% and 15.5% improvement in average event mention and argument role F1, respectively.
290. 【2603.06681】RADAR: A Multimodal Benchmark for 3D Image-Based Radiology Report Review
链接:https://arxiv.org/abs/2603.06681
作者:Zhaoyi Sun,Minal Jagtiani,Wen-wai Yim,Fei Xia,Martin Gunn,Meliha Yetisgen,Asma Ben Abacha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:meaningful discrepancies arising, reporting variability, interpretation differences, arising from interpretation, clinically meaningful discrepancies
备注:
点击查看摘要
Abstract:Radiology reports for the same patient examination may contain clinically meaningful discrepancies arising from interpretation differences, reporting variability, or evolving assessments. Systematic analysis of such discrepancies is important for quality assurance, clinical decision support, and multimodal model development, yet remains limited by the lack of standardized benchmarks. We present RADAR, a multimodal benchmark for radiology report discrepancy analysis that pairs 3D medical images with a preliminary report and corresponding candidate edits for the same study. The dataset reflects a standard clinical workflow in which trainee radiologists author preliminary reports that are subsequently reviewed and revised by attending radiologists. RADAR defines a structured discrepancy assessment task requiring models to evaluate proposed edits by determining image-level agreement, assessing clinical severity, and classifying edit type (correction, addition, or clarification). In contrast to prior work emphasizing binary error detection or comparison against fully independent reference reports, RADAR targets fine-grained clinical reasoning and image-text alignment at the report review stage. The benchmark consists of expert-annotated abdominal CT examinations and is accompanied by standardized evaluation protocols to support systematic comparison of multimodal models. RADAR provides a clinically grounded testbed for evaluating multimodal systems as reviewers of radiology report edits.
291. 【2603.06680】VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images
链接:https://arxiv.org/abs/2603.06680
作者:Neil Tripathi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:reliably answer, human viewer, viewer cannot reliably, output VISIBLY, VISIBLY
备注: 18 pages, 1 figure, 3 tables. Code and data: [this https URL](https://github.com/neilt93/Paper-with-Davis)
点击查看摘要
Abstract:We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.
292. 【2603.06679】MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines
链接:https://arxiv.org/abs/2603.06679
作者:Ryan Po,David Junhao Zhang,Amir Hertz,Gordon Wetzstein,Neal Wadhwa,Nataniel Ruiz
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Video world models, shown immense promise, players hold influence, Video world, simulation and entertainment
备注: Project page here: [this https URL](https://ryanpo.com/multigen/)
点击查看摘要
Abstract:Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.
293. 【2603.06677】Chart Deep Research in LVLMs via Parallel Relative Policy Optimization
链接:https://arxiv.org/abs/2603.06677
作者:Jiajin Tang,Gaoyang,Wenjie Wang,Sibei Yang,Xing Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:simple numerical presentation, numerical presentation tools, deep research capabilities, deep research, research capabilities
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual retrieval and basic computation, failing to assess end-to-end analytic reasoning and other deep research capabilities. To address the training challenge, we propose PRPO, which performs parallel optimization across reward dimensions and capability partitioning across data types, effectively disentangling conflicts between heterogeneous data and multi-dimensional reward signals while ensuring optimization stability. For the evaluation challenge, we construct MCDR-Bench based on the ``error uniqueness principle," transforming subjective generation assessment into objective error identification through controllable error injection, enabling quantifiable evaluation of deep research capabilities. Experimental validation confirms that the proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation.
294. 【2603.06676】XAI and Few-shot-based Hybrid Classification Model for Plant Leaf Disease Prognosis
链接:https://arxiv.org/abs/2603.06676
作者:Diana Susan Joseph,Pranav M Pawar,Raja Muthalagu,Mithun Mukharjee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Explainable Artificial Intelligence, Performing a timely, maintain agricultural productivity, integrates Explainable Artificial, food security
备注: 27 pages, 8 figures
点击查看摘要
Abstract:Performing a timely and accurate identification of crop diseases is vital to maintain agricultural productivity and food security. The current work presents a hybrid few-shot learning model that integrates Explainable Artificial Intelligence (XAI) and Few-Shot Learning (FSL) to address the challenge of identifying and classifying the stages of disease of the diseases of maize, rice, and wheat leaves under limited annotated data conditions. The proposed model integrates Siamese and Prototypical Networks within an episodic training paradigm to effectively learn discriminative disease features from a few examples. To ensure model transparency and trustworthiness, Gradient-weighted Class Activation Mapping (Grad-CAM) is employed for visualizing key decision regions in the leaf images, offering interpretable insights into the classification process. Experimental evaluations on custom few-shot datasets developed in the study prove that the model consistently achieves high accuracy, precision, recall, and F1-scores, frequently exceeding 92% across various disease stages. Comparative analyses against baseline FSL models further confirm the superior performance and explainability of the proposed approach. The framework offers a promising solution for real-world, data-constrained agricultural disease monitoring applications.
295. 【2603.06674】AutoFigure-Edit: Generating Editable Scientific Illustration
链接:https://arxiv.org/abs/2603.06674
作者:Zhen Lin,Qiujie Xie,Minjun Zhu,Shichen Li,Qiyao Sun,Enhao Gu,Yiran Ding,Ke Sun,Fang Guo,Panzhong Lu,Zhiyuan Ning,Yixuan Weng,Yue Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:automated systems remain, systems remain limited, existing automated systems, communicating complex scientific, High-quality scientific illustrations
备注:
点击查看摘要
Abstract:High-quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure-Edit, an end-to-end system that generates fully editable scientific illustrations from long-form scientific text while enabling flexible style adaptation through user-provided reference images. By combining long-context understanding, reference-guided styling, and native SVG editing, it enables efficient creation and refinement of high-quality scientific illustrations. To facilitate further progress in this field, we release the video at this https URL, full codebase at this https URL and provide a website for easy access and interactive use at this https URL.
296. 【2603.06673】Unmixing microinfrared spectroscopic images of cross-sections of historical oil paintings
链接:https://arxiv.org/abs/2603.06673
作者:Shivam Pande,Nicolas Nadisic,Francisco Mederos-Henry,Aleksandra Pizurica
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:spatially resolved characterisation, Spectroscopic imaging, enables non-invasive, spatially resolved, central to heritage
备注: 5 pages
点击查看摘要
Abstract:Spectroscopic imaging (SI) has become central to heritage science because it enables non-invasive, spatially resolved characterisation of materials in artefacts. In particular, attenuated total reflection Fourier transform infrared microscopy (ATR-$\mu$FTIR) is widely used to analyse painting cross-sections, where a spectrum is recorded at each pixel to form a hyperspectral image (HSI). Interpreting these data is difficult: spectra are often mixtures of several species in heterogeneous, multi-layered and degraded samples, and current practice still relies heavily on manual comparison with reference libraries. This workflow is slow, subjective and hard to scale. We propose an unsupervised CNN autoencoder for blind unmixing of ATR-$\mu$FTIR HSIs, estimating endmember spectra and their abundance maps while exploiting local spatial structure through patch-based modelling. To reduce sensitivity to atmospheric and acquisition artefacts across $1500$ bands, we introduce a weighted spectral angle distance (WSAD) loss with automatic band-reliability weights derived from robust measures of spatial flatness, neighbour agreement and spectral roughness. Compared with standard SAD training, WSAD improves interpretability in contamination-prone spectral regions. We demonstrate the method on an ATR-$\mu$FTIR cross-section from the Ghent Altarpiece attributed to the Van Eyck brothers.
Comments:
5 pages
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2603.06673 [cs.CV]
(or
arXiv:2603.06673v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.06673
Focus to learn more
arXiv-issued DOI via DataCite</p>
297. 【2603.06672】Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study
链接:https://arxiv.org/abs/2603.06672
作者:Yixiao Jing,Chaoyu Zhang,Zixuan Zhong,Peizhou Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:image diffusion models, Semantic noise initialization, reported to improve, improve robustness, robustness and controllability
备注: 8 pages, 1 figure. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction Beyond
点击查看摘要
Abstract:Semantic noise initialization has been reported to improve robustness and controllability in image diffusion models. Whether these gains transfer to text-to-video (T2V) generation remains unclear, since temporal coupling can introduce extra degrees of freedom and instability. We benchmark semantic noise initialization against standard Gaussian noise using a frozen VideoCrafter-style T2V diffusion backbone and VBench on 100 prompts. Using prompt-level paired tests with bootstrap confidence intervals and a sign-flip permutation test, we observe a small positive trend on temporal-related dimensions; however, the 95 percent confidence interval includes zero (p ~ 0.17) and the overall score remains on par with the baseline. To understand this outcome, we analyze the induced perturbations in noise space and find patterns consistent with weak or unstable signal. We recommend prompt-level paired evaluation and noise-space diagnostics as standard practice when studying initialization schemes for T2V diffusion.
298. 【2603.06670】calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments
链接:https://arxiv.org/abs/2603.06670
作者:Yuting Wan,Liguo Sun,Jiuwu Hao,Pin LV
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:residual misalignment biases, Camera fusion improves, fusion improves perception, Camera extrinsic calibration, degrades cross-modal aggregation
备注:
点击查看摘要
Abstract:Millimeter-wave (mmWave) Radar--Camera fusion improves perception under adverse illumination and weather, but its performance is sensitive to Radar--Camera extrinsic calibration: residual misalignment biases Radar-to-image projection and degrades cross-modal aggregation for downstream 2D detection. Existing calibration and auto-calibration methods are mainly developed for road and urban scenes with abundant structures and object constraints, whereas water-surface environments feature large textureless regions, sparse and intermittent targets, and wave-/specular-induced Radar clutter, which weakens explicit object-centric matching. We propose CalibFusion, a calibration-conditioned Radar--Camera fusion detector that learns implicit extrinsic refinement end-to-end with the detection objective. CalibFusion builds a multi-frame persistence-aware Radar density representation with intensity weighting and Doppler-guided suppression of fast-varying clutter. A cross-modal transformer interaction module predicts a confidence-gated refinement of the initial extrinsics, which is integrated through a differentiable projection-and-splatting operator to generate calibration-conditioned image-plane Radar features. Experiments on WaterScenes and FLOW show improved fusion-based 2D detection and robustness under synthetic miscalibration, supported by sensitivity analyses and qualitative Radar-to-image overlays. Results on nuScenes indicate that the refinement mechanism transfers beyond water-surface scenarios.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.06670 [cs.CV]
(or
arXiv:2603.06670v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.06670
Focus to learn more
arXiv-issued DOI via DataCite</p>
299. 【2603.06666】SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation
链接:https://arxiv.org/abs/2603.06666
作者:Zhehao Yu,Baoquan Zhang,Bingqi Shan,Xinhao Liu,Dongliang Zhou,Guotao Liang,Guangming Ye,Yunming Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remarkable generative capability, recently demonstrated remarkable, demonstrated remarkable generative, sequential nature results, significant inference latency
备注:
点击查看摘要
Abstract:Autoregressive (AR) image models have recently demonstrated remarkable generative capability, but their sequential nature results in significant inference latency. Existing training-free acceleration methods typically verify tokens independently, overlooking the strong co-occurrence patterns between adjacent visual tokens. This independence assumption often leads to contextual inconsistency and limits decoding efficiency. In this work, we introduce a novel training-free acceleration framework that performs phrase-level speculative verification, enabling the model to jointly validate multiple correlated tokens within each decoding window. To construct such phrase units, we analyze token co-occurrence statistics from the training corpus and group frequently co-occurring tokens into semantically coherent visual phrases. During inference, the proposed phrase-level verification evaluates aggregated likelihood ratios over each phrase, allowing simultaneous acceptance of multiple tokens while preserving generation quality. Extensive experiments on autoregressive text-to-image generation show that our method significantly reduces the number of function evaluations (NFE) and achieves up to 30% faster decoding without compromising visual fidelity. Our findings reveal that modeling short-range token co-occurrence provides an effective and general principle for accelerating autoregressive inference.
300. 【2603.06665】Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
链接:https://arxiv.org/abs/2603.06665
作者:Yuan Wu,Zongxian Yang,Jiayu Qian,Songpan Gao,Guanxing Chen,Qiankun Li,Yu-An Huang,Zhi-An Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large vision-language models, tasks remains underexplored, vision-language tasks remains, Large vision-language, medical vision-language tasks
备注:
点击查看摘要
Abstract:Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{this https URL}{here}.
301. 【2603.06664】Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index
链接:https://arxiv.org/abs/2603.06664
作者:Chao Yuan,Pan Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Diffusion Transformer, full spatiotemporal attention, models inherently suffer, generation models inherently, long video synthesis
备注:
点击查看摘要
Abstract:Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments conducted on an eight GPU A800 cluster show that the optimized system achieves comparable generation quality, sub-second first-frame latency, and near real-time inference speed. For generating five second 480P videos, a 1.58x speedup is achieved, thereby providing effective support for real-time interactive applications.
302. 【2603.06663】Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting
链接:https://arxiv.org/abs/2603.06663
作者:Giacomo Frisoni,Lorenzo Molfetta,Mattia Buzzoni,Gianluca Moro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multimodal language models, Recent advances, language models, training-free visual prompting, advances in training-free
备注: AAAI-26 (Main Track)
点击查看摘要
Abstract:Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.
303. 【2603.06662】HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding
链接:https://arxiv.org/abs/2603.06662
作者:Toan Nguyen,Yang Liu,Celso De Melo,Flora D. Salim
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:storing task-specific prompts, LLMs is hindered, hindered by interference, prohibitive cost, cost of storing
备注:
点击查看摘要
Abstract:Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA-VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.
304. 【2603.06661】EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis
链接:https://arxiv.org/abs/2603.06661
作者:Bikram De,Habib Irani,Vangelis Metsis
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:robust deep learning, training robust deep, crucial technique, robust deep, deep learning models
备注:
点击查看摘要
Abstract:Data augmentation is a crucial technique for training robust deep learning models for human motion, where annotated datasets are often scarce. However, generic augmentation methods often ignore the underlying geometric and kinematic constraints of the human body, risking the generation of unrealistic motion patterns that can degrade model performance. Furthermore, the conventional approach of training a single generalist model on a dataset expanded with a mixture of all available transformations does not fully exploit the unique learning signals provided by each distinct augmentation type. We challenge this convention by introducing a novel training paradigm, EnsAug, that strategically uses augmentation to foster model diversity within an ensemble. Our method involves training an ensemble of specialists, where each model learns from the original dataset augmented by only a single, distinct geometric transformation. Experiments on sign language and human activity recognition benchmarks demonstrate that our diversified ensemble methodology significantly outperforms the standard practice of training one model on a combined augmented dataset and achieves state-of-the-art accuracy on two sign language and one human activity recognition dataset while offering greater modularity and efficiency. Our primary contribution is the empirical validation of this training strategy, establishing an effective baseline for leveraging data augmentation in skeletal motion analysis.
305. 【2603.06658】ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging
链接:https://arxiv.org/abs/2603.06658
作者:Linfeng Ye,Shayan Mohajer Hamidi,Zhixiang Chi,Guang Li,Mert Pilanci,Takahiro Ogawa,Miki Haseyama,Konstantinos N. Plataniotis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:aggregate instance-level features, attention-based MIL methods, attention-based MIL, slide image, bag-level predictions
备注: 39 pages, 26 figures
点击查看摘要
Abstract:Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49\% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73\%. All code and data are publicly available at this https URL.
306. 【2603.06656】GameVerse: Can Vision-Language Models Learn from Video-based Reflection?
链接:https://arxiv.org/abs/2603.06656
作者:Kuan Zhang,Dongchen Liu,Qiyue Zhao,Jinkun Hou,Xinran Zhang,Qinlei Xie,Miao Liu,Yiming Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visually grounded interaction, Human gameplay, grounded interaction loop, players act, refine strategies
备注:
点击查看摘要
Abstract:Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision-Language Models (VLMs) also learn from video-based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire-and-forget evaluations, it uses a novel reflect-and-retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials-a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).Our project page is available at this https URL . Our code is available at this https URL .
307. 【2603.06655】A Parameter-efficient Convolutional Approach for Weed Detection in Multispectral Aerial Imagery
链接:https://arxiv.org/abs/2603.06655
作者:Leo Thomas Ramos,Angel D. Sappa
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Feature Correction Block, efficient model designed, weed segmentation, designed for weed, proposed Feature Correction
备注: 10 pages, 6 figures, 9 tables
点击查看摘要
Abstract:We introduce FCBNet, an efficient model designed for weed segmentation. The architecture is based on a fully frozen ConvNeXt backbone, the proposed Feature Correction Block (FCB), which leverages efficient convolutions for feature refinement, and a lightweight decoder. FCBNet is evaluated on the WeedBananaCOD and WeedMap datasets under both RGB and multispectral modalities, showing that FCBNet outperforms models such as U-Net, DeepLabV3+, SK-U-Net, SegFormer, and WeedSense in terms of mIoU, exceeding 85%, while also achieving superior computational efficiency, requiring only 0.06 to 0.2 hours for training. Furthermore, the frozen backbone strategy reduces the number of trainable parameters by more than 90%, significantly lowering memory requirements.
308. 【2603.06652】PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
链接:https://arxiv.org/abs/2603.06652
作者:Yantao Li,Qiang Hui,Chenyang Yan,Kanzhi Cheng,Fang Zhao,Chao Tan,Huanling Gao,Jianbing Zhang,Kai Wang,Xinyu Dai,Shiguo Lian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, designs emphasise final-answer, emphasise final-answer correctness, Language Models
备注:
点击查看摘要
Abstract:Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.
309. 【2603.06650】Margin-Consistent Deep Subtyping of Invasive Lung Adenocarcinoma via Perturbation Fidelity in Whole-Slide Image Analysis
链接:https://arxiv.org/abs/2603.06650
作者:Meghdad Sabouri Rad,Junze(Vincent)Huang,Mohammad Mehdi Hosseini,Rakesh Choudhary,Saverio J. Carello,Ola El-Zammar,Michel R. Nasr,Bardia Rodd
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Whole-slide image classification, subtyping remains vulnerable, undermine model reliability, lung adenocarcinoma subtyping, adenocarcinoma subtyping remains
备注: This document is the author's accepted manuscript (author version). The final published version is available online in the Journal of Imaging Informatics in Medicine at DOI: [https://doi.org/10.1007/s10278-026-01875-6](https://doi.org/10.1007/s10278-026-01875-6)
点击查看摘要
Abstract:Whole-slide image classification for invasive lung adenocarcinoma subtyping remains vulnerable to real-world imaging perturbations that undermine model reliability at the decision boundary. We propose a margin consistency framework evaluated on 203,226 patches from 143 whole-slide images spanning five adenocarcinoma subtypes in the BMIRDS-LUAD dataset. By combining attention-weighted patch aggregation with margin-aware training, our approach achieves robust feature-logit space alignment measured by Kendall correlations of 0.88 during training and 0.64 during validation. Contrastive regularization, while effective at improving class separation, tends to over-cluster features and suppress fine-grained morphological variation; to counteract this, we introduce Perturbation Fidelity (PF) scoring, which imposes structured perturbations through Bayesian-optimized parameters. Vision Transformer-Large achieves 95.20 +/- 4.65% accuracy, representing a 40% error reduction from the 92.00 +/- 5.36% baseline, while ResNet101 with an attention mechanism reaches 95.89 +/- 5.37% from 91.73 +/- 9.23%, a 50% error reduction. All five subtypes exceed an area under the receiver operating characteristic curve (AUC) of 0.99. On the WSSS4LUAD external benchmark, ResNet50 with an attention mechanism attains 80.1% accuracy, demonstrating cross-institutional generalizability despite approximately 15-20% domain-shift-related degradation and identifying opportunities for future adaptation research.
310. 【2603.06648】ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments
链接:https://arxiv.org/abs/2603.06648
作者:Shiyi Ding,Shaoen Wu,Ying Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large language models, multimodal large language, natural language-based scene, Recent advances, scene change queries
备注: European Chapter of the Association for Computational Linguistics (EACL) 2026 Main
点击查看摘要
Abstract:Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.
311. 【2603.06640】Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models
链接:https://arxiv.org/abs/2603.06640
作者:Ci Zhang,Zhaojun Ding,Chence Yang,Jun Liu,Xiaoming Zhai,Shaoyi Huang,Beiwen Li,Xiaolong Ma,Jin Lu,Geng Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:remove undesired concepts, recently emerged, data-independent approach, approach to remove, remove undesired
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
点击查看摘要
Abstract:Pruning-based unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts. To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining. Extensive experiments on diffusion-based unlearning based on concept related weights lead to the conclusion: once the critical concept-related weights in diffusion models are identified, our method can effectively recover the original concept regardless of how the weights are manipulated. Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.
312. 【2603.06639】RECAP: Local Hebbian Prototype Learning as a Self-Organizing Readout for Reservoir Dynamics
链接:https://arxiv.org/abs/2603.06639
作者:Heng Zhang
类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
关键词:reinforce recurring structure, high-dimensional population activity, local plasticity mechanisms, recurring structure, local plasticity
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Robust perception in brains is often attributed to high-dimensional population activity together with local plasticity mechanisms that reinforce recurring structure. In contrast, most modern image recognition systems are trained by error backpropagation and end-to-end gradient optimization, which are not naturally aligned with local computation and local plasticity. We introduce RECAP (Reservoir Computing with Hebbian Co-Activation Prototypes), a bio-inspired learning strategy for robust image classification that couples untrained reservoir dynamics with a self-organizing Hebbian prototype readout. RECAP discretizes time-averaged reservoir responses into activation levels, constructs a co-activation mask over reservoir unit pairs, and incrementally updates class-wise prototype matrices via a Hebbian-like potentiation-decay rule. Inference is performed by overlap-based prototype matching. The method avoids error backpropagation and is naturally compatible with online prototype updates. We illustrate the resulting robustness behavior on MNIST-C, where RECAP remains robust under diverse corruptions without exposure to corrupted training samples.
313. 【2603.06614】Correlation Analysis of Generative Models
链接:https://arxiv.org/abs/2603.06614
作者:Zhengguo Li,Chaobing Zheng,Wei Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:simple linear equations, Based on literature, existing diffusion models, literature review, neural network
备注:
点击查看摘要
Abstract:Based on literature review about existing diffusion models and flow matching with a neural network to predict a predefined target from noisy data, a unified representation is first proposed for these models using two simple linear equations in this paper. Theoretical analysis of the proposed model is then presented. Our theoretical analysis shows that the correlation between the noisy data and the predicted target is sometimes weak in the existing diffusion models and flow matching. This might affect the prediction (or learning) process which plays a crucial role in all models.
314. 【2603.06613】OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence
链接:https://arxiv.org/abs/2603.06613
作者:Stamatis Mastromichalakis
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:selects update rules, paper presents OptiRoulette, Tiny ImageNet, paper presents, stochastic meta-optimizer
备注: 23 pages, 10 figures, 7 tables
点击查看摘要
Abstract:This paper presents OptiRoulette, a stochastic meta-optimizer that selects update rules during training instead of fixing a single optimizer. The method combines warmup optimizer locking, random sampling from an active optimizer pool, compatibility-aware learning-rate scaling during optimizer transitions, and failure-aware pool replacement. OptiRoulette is implemented as a drop-in, "this http URL-compatible" component and packaged for pip installation. We report completed 10-seed results on five image-classification suites: CIFAR-100, CIFAR-100-C, SVHN, Tiny ImageNet, and Caltech-256. Against a single-optimizer AdamW baseline, OptiRoulette improves mean test accuracy from 0.6734 to 0.7656 on CIFAR-100 (+9.22 percentage points), 0.2904 to 0.3355 on CIFAR-100-C (+4.52), 0.9667 to 0.9756 on SVHN (+0.89), 0.5669 to 0.6642 on Tiny ImageNet (+9.73), and 0.5946 to 0.6920 on Caltech-256 (+9.74). Its main advantage is convergence reliability at higher targets: it reaches CIFAR-100/CIFAR-100-C 0.75, SVHN 0.96, Tiny ImageNet 0.65, and Caltech-256 0.62 validation accuracy in 10/10 runs, while the AdamW baseline reaches none of these targets within budget. On shared targets, OptiRoulette also reduces time-to-target (e.g., Caltech-256 at 0.59: 25.7 vs 77.0 epochs). Paired-seed deltas are positive on all datasets; CIFAR-100-C test ROC-AUC is the only metric not statistically significant in the current 10-seed study.
315. 【2603.06611】A Novel Approach for Testing Water Safety Using Deep Learning Inference of Microscopic Images of Unincubated Water Samples
链接:https://arxiv.org/abs/2603.06611
作者:Sanjay Srinivasan
类目:Other Computer Science (cs.OH); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:Target Product Profile, Fecal-contaminated water, Fecal-contaminated, UNICEF ideal Target, ideal Target Product
备注:
点击查看摘要
Abstract:Fecal-contaminated water causes diseases and even death. Current microbial water safety tests require pathogen incubation, taking 24-72 hours and costing \$20-\$50 per test. This paper presents a solution (DeepScope) exceeding UNICEF's ideal Target Product Profile requirements for presence/absence testing, with an estimated per-test cost of \$0.44. By eliminating the need for pathogen incubation, DeepScope reduces testing time by over 98\%. In DeepScope, a dataset of microscope images of bacteria and water samples was assembled. An innovative augmentation technique, generating up to 21 trillion images from a single microscope image, was developed. Four convolutional neural network models were developed using transfer learning and regularization techniques, then evaluated on a field-test dataset comprising 100,000 microscope images of unseen, real-world water samples collected from fourteen different water sources across Sammamish, WA. Precision-recall analysis showed the DeepScope model achieves 93\% accuracy, with precision of 90\% and recall exceeding 94\%. The DeepScope model was deployed on a web server, and mobile applications for Android and iOS were developed, enabling Internet-based or smartphone-based water safety testing, with results obtained in seconds.
316. 【2603.05530】ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
链接:https://arxiv.org/abs/2603.05530
作者:Wei Xue,Mingcheng Li,Xuecheng Wu,Jingqun Tang,Dingkang Yang,Lihua Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:accurately perceive complex, perceive complex visual, complex visual environments, navigation instructions, instructions and histories
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose \textbf{ProFocus}, a training-free progressive framework that unifies \underline{Pro}active Perception and \underline{Focus}ed Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus regions that guide the perception agent to acquire the required observations. For focused reasoning, we propose Branch-Diverse Monte Carlo Tree Search (BD-MCTS) to identify top-$k$ high-value waypoints from extensive historical candidates. The decision agent focuses reasoning on the historical contexts associated with these waypoints, rather than considering all historical waypoints equally. Extensive experiments validate the effectiveness of ProFocus, achieving state-of-the-art performance among zero-shot methods on R2R and REVERIE benchmarks.
317. 【2507.11202】A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition
链接:https://arxiv.org/abs/2507.11202
作者:Xinkui Zhao,Jinsong Shu,Yangyang Wu,Guanjie Cheng,Zihe Liu,Naibo Wang,Shuiguang Deng,Zhongle Xie,Jianwei Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Multimodal Emotion Recognition, Emotion Recognition, privacy protection requirements, practical applications due, encounters incomplete multimodality
备注:
点击查看摘要
Abstract:Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality's representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.
318. 【2603.08385】Rectified flow-based prediction of post-treatment brain MRI from pre-radiotherapy priors for patients with glioma
链接:https://arxiv.org/abs/2603.08385
作者:Selena Huisman,Nordin Belkacemi,Vera Keil,Joost Verhoeff,Szabolcs David
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:years of lost, life on average, MRI, lost life, Purpose
备注: 10 pages, 6 figures, 1 supplementary table
点击查看摘要
Abstract:Purpose/Objective: Brain tumors result in 20 years of lost life on average. Standard therapies induce complex structural changes in the brain that are monitored through MRI. Recent developments in artificial intelligence (AI) enable conditional multimodal image generation from clinical data. In this study, we investigate AI-driven generation of follow-up MRI in patients with in- tracranial tumors through conditional image generation. This approach enables realistic modeling of post-radiotherapy changes, allowing for treatment optimization. Material/Methods: The public SAILOR dataset of 25 patients was used to create a 2D rectified flow model conditioned on axial slices of pre-treatment MRI and RT dose maps. Cross-attention conditioning was used to incorporate temporal and chemotherapy data. The resulting images were validated with structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), Dice scores and Jacobian determinants. Results: The resulting model generates realistic follow-up MRI for any time point, while integrating treatment information. Comparing real versus predicted images, SSIM is 0.88, and PSNR is 22.82. Tissue segmentations from real versus predicted MRI result in a mean Dice-Sørensen coefficient (DSC) of 0.91. The rectified flow (RF) model enables up to 250x faster inference than Denoising Diffusion Probabilistic Models (DDPM). Conclusion: The proposed model generates realistic follow-up MRI in real-time, preserving both semantic and visual fidelity as confirmed by image quality metrics and tissue segmentations. Conditional generation allows counterfactual simulations by varying treatment parameters, producing predicted morphological changes. This capability has potential to support adaptive treatment dose planning and personalized outcome prediction for patients with intracranial tumors.
319. 【2603.07369】ask learning increases information redundancy of neural responses in macaque visual cortex
链接:https://arxiv.org/abs/2603.07369
作者:Shizhao Liu,Anton Pletenev,Ralf M. Haefner,Adam C. Snyder
类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
关键词:optimize sensory information, brain optimize sensory, information, Abstract, brain optimize
备注: published in Science, accepted manuscript prior to editing, main text: 33 pages, 5 figures, 39 supplementary pages, 22 supplementary figures, 7 supplementary tables
点击查看摘要
Abstract:How does the brain optimize sensory information for decision-making in new tasks? One hypothesis suggests learning reduces redundancy in neural representations to improve efficiency, while another, based on Bayesian inference, predicts learning increases redundancy by distributing information across neurons. We tested these hypotheses by tracking population responses in macaque cortical area V4 as monkeys learned visual discrimination tasks. We found strong support for the Bayesian predictions: task learning increased redundancy in neural responses over weeks of training and within single trials. This redundancy did not reduce information but instead increased the information carried by individual neurons. These insights suggest sensory processing in the brain reflects a generative rather than discriminative inference process.
320. 【2603.06766】HiDE: Hierarchical Dictionary-Based Entropy Modeling for Learned Image Compression
链接:https://arxiv.org/abs/2603.06766
作者:Haoxuan Xiong,Yuanyuan Xu,Kun Zhu,Yiming Wang,Baoliu Ye
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:remarkable coding efficiency, achieved remarkable coding, coding efficiency, Learned image compression, achieved remarkable
备注:
点击查看摘要
Abstract:Learned image compression (LIC) has achieved remarkable coding efficiency, where entropy modeling plays a pivotal role in minimizing bitrate through informative priors. Existing methods predominantly exploit internal contexts within the input image, yet the rich external priors embedded in large-scale training data remain largely underutilized. Recent advances in dictionary-based entropy models have demonstrated that incorporating external priors can substantially enhance compression performance. However, current approaches organize heterogeneous external priors within a single-level dictionary, resulting in imbalanced utilization and limited representational capacity. Moreover, effective entropy modeling requires not only expressive priors but also a parameter estimation network capable of interpreting them. To address these challenges, we propose HiDE, a Hierarchical Dictionary-based Entropy modeling framework for learned image compression. HiDE decomposes external priors into global structural and local detail dictionaries with cascaded retrieval, enabling structured and efficient utilization of external information. Moreover, a context-aware parameter estimator with parallel multi-receptive-field design is introduced to adaptively exploit heterogeneous contexts for accurate conditional probability estimation. Experimental results show that HiDE achieves 18.5%, 21.99%, and 24.01% BD-rate savings over VTM-12.1 on the Kodak, CLIC, and Tecnick datasets, respectively.
321. 【2603.06712】Uncertainty-Aware Solar Flare Regression
链接:https://arxiv.org/abs/2603.06712
作者:Jinsu Hong,Chetraj Pandey,Berkay Aydin
类目:olar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:frequent false alarms, lack precise quantification, Current solar flare, Current solar, resulting in frequent
备注:
点击查看摘要
Abstract:Current solar flare predictions often lack precise quantification of their reliability, resulting in frequent false alarms, particularly when dealing with datasets skewed towards extreme events. To improve the trustworthiness of space weather forecasting, it is crucial to establish confidence intervals for model predictions. Conformal prediction, a machine learning framework, presents a promising avenue for this purpose by constructing prediction intervals that ensure valid coverage in finite samples without making assumptions about the underlying data distribution. In this study, we explore the application of conformal prediction to regression tasks in space weather forecasting. Specifically, we implement full-disk solar flare prediction using images created from magnetic field maps and adapt four pre-trained deep learning models to incorporate three distinct methods for constructing confidence intervals: conformal prediction, quantile regression, and conformalized quantile regression. Our experiments demonstrate that conformalized quantile regression achieves higher coverage rates and more favorable average interval lengths compared to alternative methods, underscoring its effectiveness in enhancing the reliability of solar weather forecasting models.
322. 【2502.18775】Subclass Classification of Gliomas Using MRI Fusion Technique
链接:https://arxiv.org/abs/2502.18775
作者:Kiranmayee Janardhan,Christy Bobby Thomas
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:exhibits diverse aggressiveness, diverse aggressiveness levels, prevalent primary brain, MRI images, glioma subclass classification
备注: 15 pages, 7 figures, 1 algorithm, 4 tables, journal paper
点击查看摘要
Abstract:Glioma, the prevalent primary brain tumor, exhibits diverse aggressiveness levels and prognoses. Precise classification of glioma is paramount for treatment planning and predicting prognosis. This study aims to develop an algorithm to fuse the MRI images from T1, T2, T1ce, and fluid-attenuated inversion recovery (FLAIR) sequences to enhance the efficacy of glioma subclass classification as no tumor, necrotic core, peritumoral edema, and enhancing tumor. The MRI images from BraTS datasets were used in this work. The images were pre-processed using max-min normalization to ensure consistency in pixel intensity values across different images. The segmentation of the necrotic core, peritumoral edema, and enhancing tumor was performed on 2D and 3D images separately using UNET architecture. Further, the segmented regions from multimodal MRI images were fused using the weighted averaging technique. Integrating 2D and 3D segmented outputs enhances classification accuracy by capturing detailed features like tumor shape, boundaries, and intensity distribution in slices, while also providing a comprehensive view of spatial extent, shape, texture, and localization within the brain volume. The fused images were used as input to the pre-trained ResNet50 model for glioma subclass classification. The network is trained on 80% and validated on 20% of the data. The proposed method achieved a classification of accuracy of 99.25%, precision of 99.30%, recall of 99.10, F1 score of 99.19%, Intersection Over Union of 84.49%, and specificity of 99.76, which showed a significantly higher performance than existing techniques. These findings emphasize the significance of glioma segmentation and classification in aiding accurate diagnosis.



