本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1479篇论文,其中:

  • 自然语言处理160
  • 信息检索40
  • 计算机视觉425

自然语言处理

1. 【2603.15619】Mixture-of-Depths Attention

链接https://arxiv.org/abs/2603.15619

作者:Lianghui Zhu,Yuxin Fang,Bencheng Liao,Shijie Wang,Tianheng Cheng,Zilong Huang,Chen Chen,Lai Wei,Yutao Zeng,Ya Wang,Yi Lin,Yu Li,Xinggang Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, key driver, driver for large, large language, MoDA

备注: Code is released at [this https URL](https://github.com/hustvl/MoDA)

点击查看摘要

Abstract:Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at this https URL .

2. 【2603.15615】Mechanistic Origin of Moral Indifference in Language Models

链接https://arxiv.org/abs/2603.15615

作者:Lingyu Li,Yan Teng,Yingchun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Existing behavioral alignment, internal unaligned representations, Existing behavioral

备注: 24 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.

3. 【2603.15611】Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

链接https://arxiv.org/abs/2603.15611

作者:Aozhe Wang,Yuchen Yan,Nan Zhou,Zhengxi Lu,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen

类目:Computation and Language (cs.CL)

关键词:Reinforcement learning, test pass rates, unit test pass, Test LLM, code generation relies

备注: Project Page: [this https URL](https://zju-real.github.io/Code-A1) Code: [this https URL](https://github.com/ZJU-REAL/Code-A1)

点击查看摘要

Abstract:Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

4. 【2603.15600】From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

链接https://arxiv.org/abs/2603.15600

作者:Yibin Liu,Yaxing Lyu,Daqi Gao,Zhixuan Liang,Weiliang Tang,Shilong Mu,Xiaokang Yang,Yao Mu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate process supervision, long-horizon robotic manipulation, process supervision remains, Accurate process, robotic manipulation

备注: 31 pages

点击查看摘要

Abstract:Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

5. 【2603.15594】OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

链接https://arxiv.org/abs/2603.15594

作者:Yuwen Du,Rui Ye,Shuo Tang,Xinyu Zhu,Yijun Lu,Yuzhu Cai,Siheng Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, frontier Large Language, Deep search capabilities, Large Language, agents remains dominated

备注: 15 pages, 6 figures

点击查看摘要

Abstract:Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

6. 【2603.15547】Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

链接https://arxiv.org/abs/2603.15547

作者:Yanick Zengaffinen,Andreas Opedal,Donya Rooein,Kv Aditya Srivatsa,Shashank Sonkar,Mrinmaya Sachan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Modeling plausible student, plausible student misconceptions, requires modeling incorrect, student misconceptions, Modeling plausible

备注

点击查看摘要

Abstract:Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.

7. 【2603.15523】SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

链接https://arxiv.org/abs/2603.15523

作者:David Števaňák,Marek Šuppa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Slovak Central Register, low-resource languages remains, languages remains understudied, morphologically rich, remains understudied

备注: LREC 2026

点击查看摘要

Abstract:Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($\kappa = 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (this https URL) and evaluation code (this https URL) are publicly available.

8. 【2603.15518】Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models

链接https://arxiv.org/abs/2603.15518

作者:Xiyu Liu,Qingyi Si,Zhengxiao Liu,Chenxu Yang,Naibin Gu,Zheng Lin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, original edited form, failure mode emerges, Large Language, editing efficiently updates

备注: 23 pages, 20 figures

点击查看摘要

Abstract:While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model's geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same-subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction-following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.

9. 【2603.15513】ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

链接https://arxiv.org/abs/2603.15513

作者:Duy Vu Minh Nguyen,Chinh Thanh Truong,Phuc Hoang Tran,Hung Tuan Le,Nguyen Van-Thanh Dat,Trung Hieu Pham,Kiet Van Nguyen

类目:Computation and Language (cs.CL)

关键词:intelligent technologies aimed, Vietnamese medical research, increasingly vital domain, increasingly vital, rise of intelligent

备注

点击查看摘要

Abstract:Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.

10. 【2603.15423】Invisible failures in human-AI interactions

链接https://arxiv.org/abs/2603.15423

作者:Christopher Potts,Moritz Sudhof

类目:Computation and Language (cs.CL)

关键词:systems fail silently, fail visibly, fail silently, systems fail, fail

备注

点击查看摘要

Abstract:AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at this https URL

11. 【2603.15421】CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

链接https://arxiv.org/abs/2603.15421

作者:Taeyun Roh,Wonjune Jang,Junha Jung,Jaewoo Kang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:complex reasoning tasks, Large language model, support knowledge reuse, Large language, agents heavily rely

备注

点击查看摘要

Abstract:Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.

12. 【2603.15417】Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

链接https://arxiv.org/abs/2603.15417

作者:Vanshaj Khattar,Md Rafi ur Rashid,Moumita Choudhury,Jing Liu,Toshiaki Koike-Akino,Ming Jin,Ye Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:TTT methods, TTT, large language models, model directly learns, access to labels

备注

点击查看摘要

Abstract:Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.

13. 【2603.15409】SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

链接https://arxiv.org/abs/2603.15409

作者:Pengfei Yue,Xingran Zhao,Juntao Chen,Peng Hou,Wang Longchao,Jianghang Lin,Shengchuan Zhang,Anxiang Zeng,Liujuan Cao

类目:Computation and Language (cs.CL)

关键词:Southeast Asian languages, Document Parsing, public services, document, Southeast Asian

备注: Accepted By CVPR2026

点击查看摘要

Abstract:Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

14. 【2603.15408】rinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

链接https://arxiv.org/abs/2603.15408

作者:Kai Wang,Biaojie Zeng,Zeming Wei,Chang Jin,Hefeng Zhou,Xiangtian Li,Chao Yang,Jingjing Qu,Xingcheng Xu,Xia Hu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:LLM-based multi-agent systems, MAS, concerns have emerged, rapid development, LLM Judge Factory

备注

点击查看摘要

Abstract:With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three-tier fine-grained risk taxonomy that identifies 20 risk types, covering single-agent vulnerabilities, inter-agent communication threats, and system-level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk-specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real-time alerts, enabling both pre-development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.

15. 【2603.15405】Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models

链接https://arxiv.org/abs/2603.15405

作者:Zehao Chen,Rong Pan

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated impressive capabilities, simulating diverse human, diverse human behaviors

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., "Extroverted" vs. "Introverted"), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model's output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.

16. 【2603.15402】A Closer Look into LLMs for Table Understanding

链接https://arxiv.org/abs/2603.15402

作者:Jia Wang,Chuanyu Qin,Mingyu Zheng,Qingyi Si,Peize Li,Zheng Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, mechanisms remain unclear, internal mechanisms remain, success of Large

备注

点击查看摘要

Abstract:Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.

17. 【2603.15389】When Does Sparsity Mitigate the Curse of Depth in LLMs

链接https://arxiv.org/abs/2603.15389

作者:Dilxat Muhtar,Xinyuan Song,Sebastian Pokutta,Max Zimmer,Nico Pelleriti,Thomas Hofmann,Shiwei Liu

类目:Computation and Language (cs.CL)

关键词:large language models, Recent work, language models, work has demonstrated, demonstrated the curse

备注: 32 pages, 29 figures

点击查看摘要

Abstract:Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at this https URL.

18. 【2603.15364】CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

链接https://arxiv.org/abs/2603.15364

作者:Erick Silva,Rehana Yasmin,Ali Shoker

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:complexity and diversity, identifying the root, increasingly complex, AVs grow, grow in complexity

备注

点击查看摘要

Abstract:As AVs grow in complexity and diversity, identifying the root causes of operational failures has become increasingly complex. The heterogeneity of system architectures across manufacturers, ranging from end-to-end to modular designs, together with variations in algorithms and integration strategies, limits the standardization of incident investigations and hinders systematic safety analysis. This work examines real-world AV incidents reported in the NHTSA database. We curate a dataset of 2,168 cases reported between 2021 and 2025, representing more than 80 million miles driven. To process this data, we introduce CRASH, Cognitive Reasoning Agent for Safety Hazards, an LLM-based agent that automates reasoning over crash reports by leveraging both standardized fields and unstructured narrative descriptions. CRASH operates on a unified representation of each incident to generate concise summaries, attribute a primary cause, and assess whether the AV materially contributed to the event. Our findings show that (1) CRASH attributes 64% of incidents to perception or planning failures, underscoring the importance of reasoning-based analysis for accurate fault attribution; and (2) approximately 50% of reported incidents involve rear-end collisions, highlighting a persistent and unresolved challenge in autonomous driving deployment. We further validate CRASH with five domain experts, achieving 86% accuracy in attributing AV system failures. Overall, CRASH demonstrates strong potential as a scalable and interpretable tool for automated crash analysis, providing actionable insights to support safety research and the continued development of autonomous driving systems.

19. 【2603.15340】DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

链接https://arxiv.org/abs/2603.15340

作者:Xueyu Zhou,Yangrong Hu,Jian Huang

类目:Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:diffusion language models, Masked diffusion language, offering flexible generation, enabling efficient parallel, language models

备注: 16 pages, 5 figures

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.

20. 【2603.15326】agarela - A Portuguese speech dataset from podcasts

链接https://arxiv.org/abs/2603.15326

作者:Frederico Santos de Oliveira,Lucas Rafael Stefanel Gris,Alef Iury Siqueira Ferreira,Augusto Seben da Rosa,Alexandre Costa Ferro Filho,Edresson Casanova,Christopher Dane Shulby,Rafael Teixeira Sousa,Diogo Fernandes Costa Silva,Anderson da Silva Soares,Arlindo Rodrigues Galvão Filho

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remains under-resourced due, Portuguese remains under-resourced, scarcity of public, significant advances, remains under-resourced

备注

点击查看摘要

Abstract:Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at this https URL, to foster the development of robust speech technologies.

21. 【2603.15317】PYTHEN: A Flexible Framework for Legal Reasoning in Python

链接https://arxiv.org/abs/2603.15317

作者:Ha-Thanh Nguyen,Ken Satoh

类目:Computation and Language (cs.CL)

关键词:Python-based framework, paper introduces PYTHEN, PYTHEN, legal reasoning, legal

备注: Accepted at JURISIN 2026

点击查看摘要

Abstract:This paper introduces PYTHEN, a novel Python-based framework for defeasible legal reasoning. PYTHEN is designed to model the inherently defeasible nature of legal argumentation, providing a flexible and intuitive syntax for representing legal rules, conditions, and exceptions. Inspired by PROLEG (PROlog-based LEGal reasoning support system) and guided by the philosophy of The Zen of Python, PYTHEN leverages Python's built-in any() and all() functions to offer enhanced flexibility by natively supporting both conjunctive (ALL) and disjunctive (ANY) conditions within a single rule, as well as a more expressive exception-handling mechanism. This paper details the architecture of PYTHEN, provides a comparative analysis with PROLEG, and discusses its potential applications in autoformalization and the development of next-generation legal AI systems. By bridging the gap between symbolic reasoning and the accessibility of Python, PYTHEN aims to democratize formal legal reasoning for young researchers, legal tech developers, and professionals without extensive logic programming expertise. We position PYTHEN as a practical bridge between the powerful symbolic reasoning capabilities of logic programming and the rich, ubiquitous ecosystem of Python, making formal legal reasoning accessible to a broader range of developers and legal professionals.

22. 【2603.15309】CCTU: A Benchmark for Tool Use under Complex Constraints

链接https://arxiv.org/abs/2603.15309

作者:Junjie Ye,Guoqiang Zhang,Wenjie Fu,Tao Gui,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Solving problems, explicit constraints constitutes, large language models, requiring capabilities, function calling

备注

点击查看摘要

Abstract:Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and enforces compliance during multi-turn interactions between models and their environments. We evaluate nine state-of-the-art LLMs in both thinking and non-thinking modes. Results indicate that when strict adherence to all constraints is required, no model achieves a task completion rate above 20%. Further analysis reveals that models violate constraints in over 50% of cases, particularly in the resource and response dimensions. Moreover, LLMs demonstrate limited capacity for self-refinement even after receiving detailed feedback on constraint violations, highlighting a critical bottleneck in the development of robust tool-use agents. To facilitate future research, we release the data and code.

23. 【2603.15295】Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

链接https://arxiv.org/abs/2603.15295

作者:Giuseppe Samo,Paola Merlo

类目:Computation and Language (cs.CL); Databases (cs.DB)

关键词:capture cross-sentence paradigmatic, Large language models, sentence-based linguistic phenomena, shown remarkable performance, Large language

备注: 9 pages, 16 figures, accepted at LREC 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task -- an RPM/ARC-like task devised specifically for language -- is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.

24. 【2603.15270】From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding

链接https://arxiv.org/abs/2603.15270

作者:Xu Zhang,Wenxin Ma,Chenxu Wu,Rongsheng Wang,Kun Zhang,S. Kevin Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:ICD coding, ICD, task in healthcare, coding, critical yet challenging

备注

点击查看摘要

Abstract:ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.

25. 【2603.15259】Directional Embedding Smoothing for Robust Vision Language Models

链接https://arxiv.org/abs/2603.15259

作者:Ye Wang,Jing Liu,Toshiaki Koike-Akino

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:deploying trustworthy agentic, vision-language models, reliability of vision-language, crucial part, part of deploying

备注: Accepted at ICLR 2026 Workshop on Agents in the Wild

点击查看摘要

Abstract:The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.

26. 【2603.15245】Practicing with Language Models Cultivates Human Empathic Communication

链接https://arxiv.org/abs/2603.15245

作者:Aakriti Kumar,Nalin Poungpeth,Diyi Yang,Bruce Lambert,Matthew Groh

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:human connection, empathic, empathic communication, Empathy, normative empathic communication

备注

点击查看摘要

Abstract:Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants' communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.

27. 【2603.15227】Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation

链接https://arxiv.org/abs/2603.15227

作者:Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé

类目:Computation and Language (cs.CL); Databases (cs.DB)

关键词:specific linguistic phenomena, Machine Translation, linguistic phenomena, specific linguistic, Machine

备注: 11 pages,1 figures, Language Resources and Evaluation Conference 2026

点击查看摘要

Abstract:Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.

28. 【2603.15206】Efficient Document Parsing via Parallel Token Prediction

链接https://arxiv.org/abs/2603.15206

作者:Lei Li,Ze Zhao,Meng Li,Zhongwang Lun,Yi Yuan,Xingjing Lu,Zheng Wei,Jiang Bian,Zang Li

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial vision task, vision task, fundamental yet crucial, crucial vision, revolutionized by vision-language

备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

29. 【2603.15187】he Hrunting of AI: Where and How to Improve English Dialectal Fairness

链接https://arxiv.org/abs/2603.15187

作者:Wei Li,Adrian de Wynter

类目:Computation and Language (cs.CL)

关键词:large language models, English dialects, rarely-studied English dialects, language models, African-American Vernacular English

备注

点击查看摘要

Abstract:It is known that large language models (LLMs) underperform in English dialects, and that improving them is difficult due to data scarcity. In this work we investigate how quality and availability impact the feasibility of improving LLMs in this context. For this, we evaluate three rarely-studied English dialects (Yorkshire, Geordie, and Cornish), plus African-American Vernacular English, and West Frisian as control. We find that human-human agreement when determining LLM generation quality directly impacts LLM-as-a-judge performance. That is, LLM-human agreement mimics the human-human agreement pattern, and so do metrics such as accuracy. It is an issue because LLM-human agreement measures an LLM's alignment with the human consensus; and hence raises questions about the feasibility of improving LLM performance in locales where low populations induce low agreement. We also note that fine-tuning does not eradicate, and might amplify, this pattern in English dialects. But also find encouraging signals, such as some LLMs' ability to generate high-quality data, thus enabling scalability. We argue that data must be carefully evaluated to ensure fair and inclusive LLM improvement; and, in the presence of scarcity, new tools are needed to handle the pattern found.

30. 【2603.15164】HindSight: Evaluating Research Idea Generation via Future Impact

链接https://arxiv.org/abs/2603.15164

作者:Bo Jiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Evaluating AI-generated research, Evaluating AI-generated, ideas typically relies, human panels, typically relies

备注

点击查看摘要

Abstract:Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while HindSight shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{}0.001$). Moreover, HindSight scores are \emph{negatively} correlated with LLM-judged novelty ($\rho{=}{-}0.29$, $p{}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

31. 【2603.15159】o See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

链接https://arxiv.org/abs/2603.15159

作者:Yitong Zhang,Chengze Li,Ruize Chen,Guowei Yang,Xiaoran Jia,Yijie Ren,Jia Li

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:shown strong potential, Large Language Models, Large Language, invoke private-library APIs, shown strong

备注: 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong potential for code generation, yet they remain limited in private-library-oriented code generation, where the goal is to generate code using APIs from private libraries. Existing approaches mainly rely on retrieving private-library API documentation and injecting relevant knowledge into the context at inference time. However, our study shows that this is insufficient: even given accurate required knowledge, LLMs still struggle to invoke private-library APIs effectively. To address this limitation, we propose PriCoder, an approach that teaches LLMs to invoke private-library APIs through automatically synthesized data. Specifically, PriCoder models private-library data synthesis as the construction of a graph, and alternates between two graph operators: (1) Progressive Graph Evolution, which improves data diversity by progressively synthesizing more diverse training samples from basic ones, and (2) Multidimensional Graph Pruning, which improves data quality through a rigorous filtering pipeline. To support rigorous evaluation, we construct two new benchmarks based on recently released libraries that are unfamiliar to the tested models. Experiments on three mainstream LLMs show that PriCoder substantially improves private-library-oriented code generation, yielding gains of over 20% in pass@1 in many settings, while causing negligible impact on general code generation capability. Our code and benchmarks are publicly available at this https URL.

Comments:
12 pages

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.15159 [cs.SE]

(or
arXiv:2603.15159v2 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.15159

Focus to learn more

              arXiv-issued DOI via DataCite</p>
32. 【2603.15130】Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

链接https://arxiv.org/abs/2603.15130

作者:Miriam Winkler,Verena Blaschke,Barbara Plank

类目:Computation and Language (cs.CL)

关键词:underexplored in NLP, NLP research, Indirect Question Answering, daily communication, common feature

备注: To appear at LREC 2026

点击查看摘要

Abstract:Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.

33. 【2603.15117】MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge

链接https://arxiv.org/abs/2603.15117

作者:Baochen Fu,Yuntao Du,Cheng Chang,Baihao Jin,Wenzhi Deng,Muhao Xu,Hongmei Yan,Weiye Song,Yi Wan

类目:Computation and Language (cs.CL)

关键词:real-world knowledge continues, parametric knowledge acquired, multimodal knowledge updating, real-world knowledge, knowledge

备注

点击查看摘要

Abstract:As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.

34. 【2603.15094】Bridging National and International Legal Data: Two Projects Based on the Japanese Legal Standard XML Schema for Comparative Law Studies

链接https://arxiv.org/abs/2603.15094

作者:Makoto Nakamura

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:XML schema, Japanese Legal Standard, consecutive research projects, research projects based, computational comparative law

备注: 21 pages, 5 figures

点击查看摘要

Abstract:This paper presents an integrated framework for computational comparative law by connecting two consecutive research projects based on the Japanese Legal Standard (JLS) XML schema. The first project establishes structural interoperability by developing a conversion pipeline from JLS to the Akoma Ntoso (AKN) standard, enabling Japanese statutes to be integrated into international LegalDocML-based legislative databases. Building on this foundation, the second project applies multilingual embedding models and semantic textual similarity techniques to identify corresponding provisions across national legal systems. A prototype system combining multilingual embeddings, FAISS retrieval, and Cross-Encoder reranking generates candidate correspondences and visualizes them as cross-jurisdictional networks for exploratory comparative analysis.

35. 【2603.15061】Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

链接https://arxiv.org/abs/2603.15061

作者:Jihao Zhao,Shuaishuai Zu,Zhiyuan Ji,Chunlai Zhou,Biao Qin

类目:Computation and Language (cs.CL)

关键词:verifiable reference answers, typical open-ended generation, lacks verifiable reference, coarse feedback signals, long constrained reward

备注

点击查看摘要

Abstract:As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.

36. 【2603.15051】hinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

链接https://arxiv.org/abs/2603.15051

作者:Disha Sheshanarayana,Rajat Subhra Pal,Manjira Sinha,Tirthankar Dasgupta

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:mathematical word problems, elicit multi-step reasoning, word problems, elicit multi-step, Token-level

备注: Accepted at ICLR 2026, LIT Workshop

点击查看摘要

Abstract:Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.

37. 【2603.15034】Interpretable Predictability-Based AI Text Detection: A Replication Study

链接https://arxiv.org/abs/2603.15034

作者:Adam Skurla,Dominik Macko,Jakub Simko

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:machine-generated texts, paper replicates, replicates and extends, authorship attribution, attribution of machine-generated

备注

点击查看摘要

Abstract:This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model's decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.

38. 【2603.15031】Attention Residuals

链接https://arxiv.org/abs/2603.15031

作者:Kimi Team:Guangyu Chen,Yu Zhang,Jianlin Su,Weixin Xu,Siyuan Pan,Yaoyu Wang,Yucheng Wang,Guanduo Chen,Bohong Yin,Yutian Chen,Junjie Yan,Ming Wei,Y. Zhang,Fanqing Meng,Chao Hong,Xiaotong Xie,Shaowei Liu,Enzhe Lu,Yunpeng Tai,Yanru Chen,Xin Men,Haiqing Guo,Y. Charles,Haoyu Lu,Lin Sui,Jinguo Zhu,Zaida Zhou,Weiran He,Weixiao Huang,Xinran Xu,Yuzhi Wang,Guokun Lai,Yulun Du,Yuxin Wu,Zhilin Yang,Xinyu Zhou

类目:Computation and Language (cs.CL)

关键词:fixed unit weights, modern LLMs, preceding layer outputs, layer outputs, Residual connections

备注: attnres tech report

点击查看摘要

Abstract:Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Comments:
attnres tech report

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.15031 [cs.CL]

(or
arXiv:2603.15031v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.15031

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
39. 【2603.15020】MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal

链接https://arxiv.org/abs/2603.15020

作者:Yiqi Nie,Fei Wang,Junjie Chen,Kun Li,Yudi Cai,Dan Guo,Chenglong Li,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:jointly convey nuanced, overlaid text jointly, text jointly convey, convey nuanced affect, tightly coupled

备注

点击查看摘要

Abstract:Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: this https URL.

40. 【2603.15005】Pretraining and Benchmarking Modern Encoders for Latvian

链接https://arxiv.org/abs/2603.15005

作者:Arturs Znotins

类目:Computation and Language (cs.CL)

关键词:Encoder-only transformers remain, transformers remain essential, Encoder-only transformers, practical NLP tasks, Latvian remain underrepresented

备注

点击查看摘要

Abstract:Encoder-only transformers remain essential for practical NLP tasks. While recent advances in multilingual models have improved cross-lingual capabilities, low-resource languages such as Latvian remain underrepresented in pretraining corpora, and few monolingual Latvian encoders currently exist. We address this gap by pretraining a suite of Latvian-specific encoders based on RoBERTa, DeBERTaV3, and ModernBERT architectures, including long-context variants, and evaluating them across a diverse set of Latvian diagnostic and linguistic benchmarks. Our models are competitive with existing monolingual and multilingual encoders while benefiting from recent architectural and efficiency advances. Our best model, lv-deberta-base (111M parameters), achieves the strongest overall performance, outperforming larger multilingual baselines and prior Latvian-specific encoders. We release all pretrained models and evaluation resources to support further research and practical applications in Latvian NLP.

41. 【2603.14997】OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora

链接https://arxiv.org/abs/2603.14997

作者:Jeffrey Flynt

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Evaluating retrieval-augmented generation, rarely provide cleanly, pipelines requires corpora, real-world datasets rarely, datasets rarely provide

备注

点击查看摘要

Abstract:Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across this http URL present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.

42. 【2603.14987】Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

链接https://arxiv.org/abs/2603.14987

作者:Jinhu Qi,Yifan Li,Minghao Zhao,Wentao Zhang,Zijian Zhang,Yaoman Li,Irwin King

类目:Computation and Language (cs.CL); Databases (cs.DB)

关键词:increased authority poses, authority poses greater, multi-step real-world workflows, static question answering, poses greater risks

备注: 6 pages, 1 figure. Submitted to KDD 2026 Blue Sky Track

点击查看摘要

Abstract:As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent's trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity -- particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at this https URL.

43. 【2603.14975】Why Agents Compromise Safety Under Pressure

链接https://arxiv.org/abs/2603.14975

作者:Hengle Jiang,Ke Tang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词:Large Language Model, Large Language, complex environments frequently, environments frequently encounter, maximizing goal achievement

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.

44. 【2603.14968】Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

链接https://arxiv.org/abs/2603.14968

作者:Zhuoshang Wang,Yubing Ren,Yanan Cao,Fang Fang,Xiaoxue Li,Li Guo

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:provider-side scheme-specific detectors, LLM provenance, mechanism for LLM, existing secret-key schemes, secret-key schemes tightly

备注

点击查看摘要

Abstract:While watermarking serves as a critical mechanism for LLM provenance, existing secret-key schemes tightly couple detection with injection, requiring access to keys or provider-side scheme-specific detectors for verification. This dependency creates a fundamental barrier for real-world governance, as independent auditing becomes impossible without compromising model security or relying on the opaque claims of service providers. To resolve this dilemma, we introduce TTP-Detect, a pioneering black-box framework designed for non-intrusive, third-party watermark verification. By decoupling detection from injection, TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions. Extensive experiments across representative watermarking schemes, datasets and models demonstrate that TTP-Detect achieves superior detection performance and robustness against diverse attacks.

45. 【2603.14937】LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs

链接https://arxiv.org/abs/2603.14937

作者:Ying Zhang,Hang Yu,Haipeng Zhang,Peng Di

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:abundant textual information, integrate complex structural, complex structural dependencies, existing learning paradigms, integrate complex

备注: 20 pages, 5 figures. Work in progress

点击查看摘要

Abstract:Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node's raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.

46. 【2603.14911】Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs

链接https://arxiv.org/abs/2603.14911

作者:Nikita Mosievskiy

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:mapping Common Vulnerabilities, Common Weakness Enumeration, Vulnerabilities and Exposures, Common Vulnerabilities, fine-tuned RoBERTa-base classifier

备注: 9 pages, 2 figures, 6 tables. Dataset: [this https URL](https://huggingface.co/datasets/xamxte/cve-to-cwe) Model: [this https URL](https://huggingface.co/xamxte/cwe-classifier-roberta-base)

点击查看摘要

Abstract:We present a fine-tuned RoBERTa-base classifier (125M parameters) for mapping Common Vulnerabilities and Exposures (CVE) descriptions to Common Weakness Enumeration (CWE) categories. We construct a large-scale training dataset of 234,770 CVE descriptions with AI-refined CWE labels using Claude Sonnet 4.6, and agreement-filtered evaluation sets where NVD and AI labels agree. On our held-out test set (27,780 samples, 205 CWE classes), the model achieves 87.4% top-1 accuracy and 60.7% Macro F1 -- a +15.5 percentage-point Macro F1 gain over a TF-IDF baseline that already reaches 84.9% top-1, demonstrating the model's advantage on rare weakness categories. On the external CTI-Bench benchmark (NeurIPS 2024), the model achieves 75.6% strict accuracy (95% CI: 72.8-78.2%) -- statistically indistinguishable from Cisco Foundation-Sec-8B-Reasoning (75.3%, 8B parameters) at 64x fewer parameters. We release the dataset, model, and training code.

47. 【2603.14903】ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

链接https://arxiv.org/abs/2603.14903

作者:Yuzhe Shang,Pengzhi Gao,Yazheng Yang,Jiayao Ma,Wei Liu,Jian Luan,Jingsong Su

类目:Computation and Language (cs.CL)

关键词:recently demonstrated promising, demonstrated promising performance, Large language models, Large language, recently demonstrated

备注

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.

48. 【2603.14893】LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

链接https://arxiv.org/abs/2603.14893

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Expected Calibration Error, Large language models, Large language, Error that conflate, Signal Detection Theory

备注: 15 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model's ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.

49. 【2603.14891】Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

链接https://arxiv.org/abs/2603.14891

作者:Han Zhang,Jiamin Su,Li liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:predicts multiple rubric-defined, discrete rating scale, ordered discrete rating, Automated essay scoring, multiple rubric-defined trait

备注

点击查看摘要

Abstract:Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.

50. 【2603.14884】Customizing ChatGPT for Second Language Speaking Practice: Genuine Support or Just a Marketing Gimmick?

链接https://arxiv.org/abs/2603.14884

作者:Fanfei Meng

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:ChatGPT Voice Mode, Voice Mode, uncustomized Advanced mode, Advanced mode, customized Advanced mode

备注: Short paper accepted at the International Conference of the Learning Sciences (ICLS) 2025, International Society of the Learning Sciences

点击查看摘要

Abstract:ChatGPT, with its customization features and Voice Mode, has the potential for more engaging and peresonalized ESL (English as a Second Language) education. This study examines the efficacy of customized ChatGPT conversational features in facilitating ESL speaking practices, comparing the performance of four versions of ChatGPT Voice Mode: uncustomized Standard mode, uncustomized Advanced mode, customized Standard mode, and customized Advanced mode. Customization was guided by prompt engineering principles and grounded in relevant theories, including Motivation Theory, Culturally Responsive Teaching (CRT), Communicative Language Teaching (CLT), and the Affective Filter Hypothesis. Content analysis found that customized versions generally provided more balanced feedback and emotional support, contributing to a positive and motivating learning environment. However, cultural responsiveness did not show significant improvement despite targeted customization efforts. These initial findings suggest that customization could enhance ChatGPT's capacity as a more effective language tutor, with the standard model already capable of meeting the learning needs. The study underscores the importance of prompt engineering and AI literacy in maximizaing AI's potential in language learning.

51. 【2603.14873】Developing an English-Efik Corpus and Machine Translation System for Digitization Inclusion

链接https://arxiv.org/abs/2603.14873

作者:Offiong Bassey Edet,Mbuotidem Sunday Awak,Emmanuel Oyo-Ita,Benjamin Okon Nyong,Ita Etim Bassey

类目:Computation and Language (cs.CL)

关键词:Low-resource languages serve, human history, preserving cultural, intellectual diversity, serve as invaluable

备注: 8 pages, 1 figure, accepted at AfricaNLP 2026 (co-located with EACL)

点击查看摘要

Abstract:Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English-Efik translation, leveraging a small-scale, community-curated parallel corpus of 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English-Efik and 31.21 for Efik-English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.

52. 【2603.14864】Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

链接https://arxiv.org/abs/2603.14864

作者:Zijian Yu,Kejun Xiao,Huaipeng Zhao,Tao Luo,Xiaoyi Zeng

类目:Computation and Language (cs.CL)

关键词:agents show promise, LLM agents show, accurately capturing user, bundle deals, conversations is critical

备注: Subbmited to ACL 2026

点击查看摘要

Abstract:In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.

53. 【2603.14843】ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

链接https://arxiv.org/abs/2603.14843

作者:Hankun Kang,Xin Miao,Jianhao Chen,Jintao Wen,Mayi Xu,Weiyu Zhang,Wenpeng Lu,Tieyun Qian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:online social actions, online social environment, healthy online social, online social, social actions

备注

点击查看摘要

Abstract:Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...

54. 【2603.14838】he Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

链接https://arxiv.org/abs/2603.14838

作者:Elmira Salari(1),Maria Claudia Nunes Delfino(2),Hazem Amamou(3),José Victor de Souza(3),Shruti Kshirsagar(1),Alan Davoust(4),Anderson Avila(3) ((1) Wichita State University, (2) Pontifícia Universidade Católica de São Paulo, (3) Institut national de la recherche scientifique, (4) Université du Québec en Outaouais)

类目:Computation and Language (cs.CL)

关键词:large language models, paper studies, studies the impact, large language, ideological

备注

点击查看摘要

Abstract:This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs' responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs' responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs' outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.

55. 【2603.14803】VorTEX: Various overlap ratio for Target speech EXtraction

链接https://arxiv.org/abs/2603.14803

作者:Ro-hoon Oh,Jihwan Seol,Bugeun Kim

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:target speaker voice, Target speech extraction, Decoupled Adaptive Multi-branch, aims to recover, Target speech

备注: arXiv Preprint

点击查看摘要

Abstract:Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.

56. 【2603.14799】Universe Routing: Why Self-Evolving Agents Need Epistemic Control

链接https://arxiv.org/abs/2603.14799

作者:Zhaohui Geoffrey Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:critical failure mode, lack of knowledge, mode of current, inability to decide, current lifelong agents

备注: 10 pages. Accepted at the LLA Workshop at ICLR 2026 (camera-ready version)

点击查看摘要

Abstract:A critical failure mode of current lifelong agents is not lack of knowledge, but the inability to decide how to reason. When an agent encounters "Is this coin fair?" it must recognize whether to invoke frequentist hypothesis testing or Bayesian posterior inference - frameworks that are epistemologically incompatible. Mixing them produces not minor errors, but structural failures that propagate across decision chains. We formalize this as the universe routing problem: classifying questions into mutually exclusive belief spaces before invoking specialized solvers. Our key findings challenge conventional assumptions: (1) hard routing to heterogeneous solvers matches soft MoE accuracy while being 7x faster because epistemically incompatible frameworks cannot be meaningfully averaged; (2) a 465M-parameter router achieves a 2.3x smaller generalization gap than keyword-matching baselines, indicating semantic rather than surface-level reasoning; (3) when expanding to new belief spaces, rehearsal-based continual learning achieves zero forgetting, outperforming EWC by 75 percentage points, suggesting that modular epistemic architectures are fundamentally more amenable to lifelong learning than regularization-based approaches. These results point toward a broader architectural principle: reliable self-evolving agents may require an explicit epistemic control layer that governs reasoning framework selection.

57. 【2603.14782】Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

链接https://arxiv.org/abs/2603.14782

作者:Renhao Pei,Siyao Peng,Verena Blaschke,Robert Litschko,Barbara Plank

类目:Computation and Language (cs.CL)

关键词:Large Language Models, reliability vary widely, Language Models, vary widely, humans to seek

备注: 23 pages, accepted at LREC 2026 as an oral presentation

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

58. 【2603.14779】Vietnamese Automatic Speech Recognition: A Revisit

链接https://arxiv.org/abs/2603.14779

作者:Thi Vu,Linh The Nguyen,Dat Quoc Nguyen

类目:Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, Automatic Speech, Speech Recognition, performance is heavily, availability of large-scale

备注: Accepted to EACL 2026 Findings

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at this https URL.

59. 【2603.14756】owards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark

链接https://arxiv.org/abs/2603.14756

作者:Wei Shao,Lemao Liu,Yinqiao Li,Guoping Huang,Shuming Shi,Linqi Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Current online translation, online translation services, require sending user, services require sending, Current online

备注: 15 pages, 5 figures, Accepted by IEEE Journal of Selected Topics in Signal Processing

点击查看摘要

Abstract:Current online translation services require sending user text to cloud servers, posing a risk of privacy leakage when the text contains sensitive information. This risk hinders the application of online translation services in privacy-sensitive scenarios. One way to mitigate this risk for online translation services is introducing privacy protection mechanisms targeting the inference stage of translation models. However, compared to subfields of NLP like text classification and summarization, the machine translation research community has limited exploration of privacy protection during the inference stage. There is no clearly defined privacy protection task for the inference stage, dedicated evaluation datasets and metrics, and reference benchmark methods. The absence of these elements has seriously constrained researchers' in-depth exploration of this direction. To bridge this gap, this paper proposes a novel "Privacy-Preserving Machine Translation" (PPMT) task, aiming to protect the private information in text during the model inference stage. For this task, we constructed three benchmark test datasets, designed corresponding evaluation metrics, and proposed a series of benchmark methods as a starting point for this task. The definition of privacy is complex and diverse. Considering that named entities often contain a large amount of personal privacy and commercial secrets, we have focused our research on protecting only the named entity's privacy in the text. We expect this research work will provide a new perspective and a solid foundation for the privacy protection problem in machine translation.

60. 【2603.14755】Learning Constituent Headedness

链接https://arxiv.org/abs/2603.14755

作者:Zeyao Qi,Yige Chen,KyungTae Lim,Haihua Pan,Jungyeul Park

类目:Computation and Language (cs.CL)

关键词:treebanks rarely encode, processing pipelines recover, constituency treebanks rarely, syntactic analysis, organizing device

备注

点击查看摘要

Abstract:Headedness is widely used as an organizing device in syntactic analysis, yet constituency treebanks rarely encode it explicitly and most processing pipelines recover it procedurally via percolation rules. We treat this notion of constituent headedness as an explicit representational layer and learn it as a supervised prediction task over aligned constituency and dependency annotations, inducing supervision by defining each constituent head as the dependency span head. On aligned English and Chinese data, the resulting models achieve near-ceiling intrinsic accuracy and substantially outperform Collins-style rule-based percolation. Predicted heads yield comparable parsing accuracy under head-driven binarization, consistent with the induced binary training targets being largely equivalent across head choices, while increasing the fidelity of deterministic constituency-to-dependency conversion and transferring across resources and languages under simple label-mapping interfaces.

61. 【2603.14723】Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

链接https://arxiv.org/abs/2603.14723

作者:Xinran Zhang

类目:Computation and Language (cs.CL)

关键词:explicit identity content, written may matter, core safety rules, B-matched creed condition, identity content

备注

点击查看摘要

Abstract:How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D B C \geq A baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.14723 [cs.CL]

(or
arXiv:2603.14723v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.14723

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
62. 【2603.14712】owards Next-Generation LLM Training: From the Data-Centric Perspective

链接https://arxiv.org/abs/2603.14712

作者:Hao Liang,Zhengyang Zhao,Zhaoyang Han,Meiyi Qiang,Xiaochen Ma,Bohan Zeng,Qifeng Cai,Zhiyu Li,Linpeng Tang,Weinan E,Wentao Zhang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, demonstrated remarkable performance, Large language, language models, tasks and domains

备注

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of the massive datasets required for LLM training remain major bottlenecks. In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing data scientists from repetitive and error-prone engineering efforts. Moreover, once collected, datasets are often consumed largely in their entirety during training, without systematic mechanisms for data selection, mixture optimization, or reweighting. To address these limitations, we advocate two complementary research directions. First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management. Second, we argue for a unified data-model interaction training system in which data is dynamically selected, mixed, and reweighted throughout the training process, enabling more efficient, adaptive, and performance-aware data utilization. Finally, we discuss the remaining challenges and outline promising directions for future research and system development.

63. 【2603.14707】Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

链接https://arxiv.org/abs/2603.14707

作者:Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:graphical user interfaces, Computer-using agents, act directly, user interfaces, directly on graphical

备注

点击查看摘要

Abstract:Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent's perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent's reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnote{Model, benchmark, and code: this https URL}.

64. 【2603.14674】Computational Analysis of Semantic Connections Between Herman Melville Reading and Writing

链接https://arxiv.org/abs/2603.14674

作者:Nudrat Habib,Elisa Barney Smith,Steven Olsen Smith

类目:Computation and Language (cs.CL)

关键词:Herman Melville reading, Herman Melville, Melville reading, study investigates, investigates the potential

备注

点击查看摘要

Abstract:This study investigates the potential influence of Herman Melville reading on his own writings through computational semantic similarity analysis. Using documented records of books known to have been owned or read by Melville, we compare selected passages from his works with texts from his library. The methodology involves segmenting texts at both sentence level and non-overlapping 5-gram level, followed by similarity computation using BERTScore. Rather than applying fixed thresholds to determine reuse, we interpret precision, recall, and F1 scores as indicators of possible semantic alignment that may suggest literary influence. Experimental results demonstrate that the approach successfully captures expert-identified instances of similarity and highlights additional passages warranting further qualitative examination. The findings suggest that semantic similarity methods provide a useful computational framework for supporting source and influence studies in literary scholarship.

65. 【2603.14672】Seamless Deception: Larger Language Models Are Better Knowledge Concealers

链接https://arxiv.org/abs/2603.14672

作者:Dhananjay Ashok,Ruth-Ann Armstrong,Jonathan May

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:acquire harmful knowledge, Language Models, acquire harmful, feign ignorance, Language

备注

点击查看摘要

Abstract:Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.

66. 【2603.14664】Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI

链接https://arxiv.org/abs/2603.14664

作者:Mark Baciak,Thomas A. Cellucci,Deanna M. Falkowski

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:artificial intelligence development, intelligence development assumes, dominant narrative, narrative of artificial, artificial intelligence

备注

点击查看摘要

Abstract:The dominant narrative of artificial intelligence development assumes that progress is continuous and that capability scales monotonically with model size. We challenge both assumptions. Drawing on punctuated equilibrium theory from evolutionary biology, we show that AI development proceeds not through smooth advancement but through extended periods of stasis interrupted by rapid phase transitions that reorganize the competitive landscape. We identify five such eras since 1943 and four epochs within the current Generative AI Era, each initiated by a discontinuous event -- from the transformer architecture to the DeepSeek Moment -- that rendered the prior paradigm subordinate. To formalize the selection pressures driving these transitions, we develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance. The central result is the Institutional Scaling Law, which proves that institutional fitness is non-monotonic in model scale. Beyond an environment-specific optimum, scaling further degrades fitness as trust erosion and cost penalties outweigh marginal capability gains. This directly contradicts classical scaling laws and carries a strong implication: orchestrated systems of smaller, domain-adapted models can mathematically outperform frontier generalists in most institutional deployment environments. We derive formal conditions under which this inversion holds and present supporting empirical evidence spanning frontier laboratory dynamics, post-training alignment evolution, and the rise of sovereign AI as a geopolitical selection pressure.

67. 【2603.14643】Argumentation for Explainable and Globally Contestable Decision Support with LLMs

链接https://arxiv.org/abs/2603.14643

作者:Adam Dejl,Matthew Williams,Francesca Toni

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, exhibit strong general, strong general capabilities, language models

备注

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.

68. 【2603.14636】Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

链接https://arxiv.org/abs/2603.14636

作者:Lok-Lam Ieong,Chia-Chien Chen,Chih-Kai Yang,Yu-Han Huang,An-Yu Cheng,Hung-yi Lee

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:training remains challenging, large audio-language models, remains challenging, extended to large, large audio-language

备注: 6 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning.

69. 【2603.14631】Anterior's Approach to Fairness Evaluation of Automated Prior Authorization System

链接https://arxiv.org/abs/2603.14631

作者:Sai P. Selvaraj,Khadija Mahmoud,Anuj Iravane

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Increasing staffing constraints, Increasing staffing, increasing automation, support PA review, staffing constraints

备注

点击查看摘要

Abstract:Increasing staffing constraints and turnaround-time pressures in Prior authorization (PA) have led to increasing automation of decision systems to support PA review. Evaluating fairness in such systems poses unique challenges because legitimate clinical guidelines and medical necessity criteria often differ across demographic groups, making parity in approval rates an inappropriate fairness metric. We propose a fairness evaluation framework for prior authorization models based on model error rates rather than approval outcomes. Using 7,166 human-reviewed cases spanning 27 medical necessity guidelines, we assessed consistency in sex, age, race/ethnicity, and socioeconomic status. Our evaluation combined error-rate comparisons, tolerance-band analysis with a predefined $\pm$5 percentage-point margin, statistical power evaluation, and protocol-controlled logistic regression. Across most demographics, model error rates were consistent, and confidence intervals fell within the predefined tolerance band, indicating no meaningful performance differences. For race/ethnicity, point estimates remain small, but subgroup sample sizes were limited, resulting in wide confidence intervals and underpowered tests, with inconclusive evidence within the dataset we explored. These findings illustrate a rigorous and regulator-aligned approach to fairness evaluation in administrative healthcare AI systems.

70. 【2603.14602】$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

链接https://arxiv.org/abs/2603.14602

作者:Shubhashis Roy Dipta,Daniel Bis,Kun Zhou,Lichao Wang,Benjamin Z. Yao,Chenlei Guo,Ruhi Sarikaya

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Conversational assistants powered, Conversational assistants, large language models, excel at tool-use, adhering to complex

备注

点击查看摘要

Abstract:Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

71. 【2603.14593】Parameter-Efficient Quality Estimation via Frozen Recursive Models

链接https://arxiv.org/abs/2603.14593

作者:Umar Abubacar,Roman Bauer,Diptesh Kanojia

类目:Computation and Language (cs.CL)

关键词:Tiny Recursive Models, Tiny Recursive, Recursive Models, shared network, achieve strong results

备注: Accepted to LowResLM Workshop @ EACL 2026

点击查看摘要

Abstract:Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low-resource languages using a three-phase methodology. Experiments on $8$ language pairs on a low-resource QE dataset reveal three findings. First, TRM's recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine-tuned performance while reducing trainable parameters by 37$\times$ (7M vs 262M). TRM-QE with frozen XLM-R embeddings achieves a Spearman's correlation of 0.370, matching fine-tuned variants (0.369) and outperforming an equivalent-depth standard transformer (0.336). On Hindi and Tamil, frozen TRM-QE outperforms MonoTransQuest (560M parameters) with 80$\times$ fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at this https URL.

72. 【2603.14575】CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad

链接https://arxiv.org/abs/2603.14575

作者:Yongqiang Chen,Chenxi Liu,Zhenhao Chen,Tongliang Liu,Bo Han,Kun Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:Large Language Models, Language Models, Large Language, build AI Scientists, notable successes

备注: Preprint of ongoing work; Yongqiang and Chenxi contributed equally;

点击查看摘要

Abstract:Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists. These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs. Despite the success, existing evolve-based agents lack targeted guidance for evolution and effective mechanisms for organizing and utilizing knowledge acquired from past evolutionary experience. Consequently, they suffer from decreasing evolution efficiency and exhibit oscillatory behavior when approaching known performance boundaries. To mitigate the gap, we develop CausalEvolve, equipped with a causal scratchpad that leverages LLMs to identify and reason about guiding factors for evolution. At the beginning, CausalEvolve first identifies outcome-level factors that offer complementary inspirations in improving the target objective. During the evolution, CausalEvolve also inspects surprise patterns during the evolution and abductive reasoning to hypothesize new factors, which in turn offer novel directions. Through comprehensive experiments, we show that CausalEvolve effectively improves the evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.

73. 【2603.14567】op-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes

链接https://arxiv.org/abs/2603.14567

作者:Deepon Halder,Raj Dabre

类目:Computation and Language (cs.CL)

关键词:Probabilistic language generators, discrete stochastic processes, impose static truncation, static truncation rules, Probabilistic language

备注

点击查看摘要

Abstract:Probabilistic language generators are theoretically modeled as discrete stochastic processes, yet standard decoding strategies (Top-k, Top-p) impose static truncation rules that fail to accommodate the dynamic information density of natural language. This misalignment often forces a suboptimal trade-off: static bounds are either too restrictive for high-entropy creative generation or too permissive for low-entropy logical reasoning. In this work, we formalize the generation process as a trajectory through a relative probability manifold. We introduce Top-b (Adaptive Relative Band Sampling), a decoding strategy that regulates the candidate set via a dynamic bandwidth coefficient coupled strictly to the instantaneous Shannon entropy of the model's distribution. We provide a theoretical framework demonstrating that Top-b acts as a variance-minimizing operator on the tail distribution. Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating control system for autoregressive generation.

74. 【2603.14563】Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

链接https://arxiv.org/abs/2603.14563

作者:Deepon Halder,Angira Mukherjee

类目:Computation and Language (cs.CL)

关键词:domain-appropriate training corpora, robust language models, scarcity of high-quality, development of robust, frequently bottlenecked

备注

点击查看摘要

Abstract:The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

75. 【2603.14525】MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

链接https://arxiv.org/abs/2603.14525

作者:Arkadiusz Modzelewski,Witold Sosnowski,Eleni Papadopulos,Elisa Sartori,Tiziano Labruna,Giovanni Da San Martino,Adam Wierzbicki

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:public discourse, intentional creation, creation and spread, poses a significant, significant threat

备注: Paper accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.

76. 【2603.14501】CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language

链接https://arxiv.org/abs/2603.14501

作者:Junhang Cheng,Fang Liu,Jia Li,Chengru Wu,Nanxiang Jiang,Li Zhang

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, high-resource programming languages, low-resource programming languages, programming languages

备注: 26 pages, 20 figures

点击查看摘要

Abstract:Large Language Models excel in high-resource programming languages but struggle with low-resource ones. Existing research related to low-resource programming languages primarily focuses on Domain-Specific Languages (DSLs), leaving general-purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination-free benchmark for Cangjie, a representative low-resource general-purpose language. The benchmark comprises 248 high-quality samples manually translated from HumanEval and ClassEval, covering both Text-to-Code and Code-to-Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax-Constrained Generation, Retrieval-Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax-Constrained Generation offers the best trade-off between accuracy and computational cost. Agent achieve state-of-the-art accuracy but incur high token consumption. Furthermore, we observe that Code-to-Code translation often underperforms Text-to-Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low-resource programming languages. Our code and data are available at this https URL.

77. 【2603.14493】Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

链接https://arxiv.org/abs/2603.14493

作者:He Li,Yuhui Zhang,Xiaohan Wang,Kaifeng Lyu,Serena Yeung-Levy

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, mitigate catastrophic forgetting, simple adjustments, fine-tuning recipes

备注

点击查看摘要

Abstract:The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.

78. 【2603.14486】Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

链接https://arxiv.org/abs/2603.14486

作者:Aditya Sharan,Sriram Hebbale,Dhruv Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Training large language, large language models, Training large, high-quality data, scarcity of verifiable

备注

点击查看摘要

Abstract:Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ($R^2 \approx 0.95$) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.

79. 【2603.14473】AI Can Learn Scientific Taste

链接https://arxiv.org/abs/2603.14473

作者:Jingqi Tong,Mingzhe Li,Hangcheng Li,Yongzhuo Yang,Yurong Mou,Weijie Ma,Zhiheng Xi,Hongji Chen,Xiaoran Liu,Qinyuan Cheng,Ming Zhang,Qiguang Chen,Weifeng Ge,Qipeng Guo,Tianlei Ying,Tianxiang Sun,Yining Zheng,Xinchi Chen,Jun Zhao,Ning Ding,Xuanjing Huang,Yugang Jiang,Xipeng Qiu

类目:Computation and Language (cs.CL)

关键词:Great scientists, scientific, scientific taste, call scientific taste, Scientific Judge

备注: 44 pages, 4 figures

点击查看摘要

Abstract:Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

80. 【2603.14463】An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

链接https://arxiv.org/abs/2603.14463

作者:Qian Zhu,Xinnan Guo,Jingjing Huo,Jun Li,Pan Liu,Wenyan Yang,Wanqing Xu,Xuan Lin

类目:Computation and Language (cs.CL)

关键词:Adapting Large Language, Large Language Models, Adapting Large, Language Models, Large Language

备注: 21 pages, 12 figures, 17 tables

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

81. 【2603.14458】Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

链接https://arxiv.org/abs/2603.14458

作者:Auksarapak Kietkajornrit,Jad Tarifi,Nima Asgharbeygi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Fact-seeking question answering, large language models, remains unreliable, conflicting information, question answering

备注

点击查看摘要

Abstract:Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.

82. 【2603.14456】PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

链接https://arxiv.org/abs/2603.14456

作者:Mohammad Javad Ranjbar Kalahroodi,Mohammad Amini,Parmis Bathayan,Heshaam Faili,Azadeh Shakery

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:Persian poses unique, poses unique audio, Persian Audio Reasoning, Speech Assessment Benchmark, unique audio understanding

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at this https URL

83. 【2603.14443】Echoes Across Centuries: Phonetic Signatures of Persian Poets

链接https://arxiv.org/abs/2603.14443

作者:Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

类目:Computation and Language (cs.CL)

关键词:Persian, examines phonetic texture, phonetic, Persian poetry, study examines phonetic

备注

点击查看摘要

Abstract:This study examines phonetic texture in Persian poetry as a literary-historical phenomenon rather than a by-product of meter or a feature used only for classification. The analysis draws on a large corpus of 1,116,306 mesras from 31,988 poems written by 83 poets, restricted to five major classical meters to enable controlled comparison. Each line is converted into a grapheme-to-phoneme representation and analyzed using six phonetic metrics: hardness, sonority, sibilance, vowel ratio, phoneme entropy, and consonant-cluster ratio. Statistical models estimate poet-level differences while controlling for meter, poetic form, and line length. The results show that although meter and form explain a substantial portion of phonetic variation, they do not eliminate systematic differences between poets. Persian poetic sound therefore appears as conditioned variation within shared prosodic structures rather than as either purely individual style or simple metrical residue. A multidimensional stylistic map reveals several recurrent phonetic profiles, including high-sonority lyric styles, hardness-driven rhetorical or epic styles, sibilant mystical contours, and high-entropy complex textures. Historical analysis indicates that phonetic distributions shift across centuries, reflecting changes in genre prominence, literary institutions, and performance contexts rather than abrupt stylistic breaks. The study establishes a corpus-scale framework for phonetic analysis in Persian poetry and demonstrates how computational phonetics can contribute to literary-historical interpretation while remaining attentive to the formal structures that shape Persian verse.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.14443 [cs.CL]

(or
arXiv:2603.14443v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.14443

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Kourosh Shahnazari [view email] [v1]
Sun, 15 Mar 2026 15:41:21 UTC (7,812 KB)

84. 【2603.14430】Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

链接https://arxiv.org/abs/2603.14430

作者:Yuanchi Ma,Kaize Shi,Hui He,Zhihua Zhang,Zhongxiang Lei,Ziliang Qiu,Renfen Hu,Jiamou Liu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, narrative

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp's narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.

85. 【2603.14417】Questionnaire Responses Do not Capture the Safety of AI Agents

链接https://arxiv.org/abs/2603.14417

作者:Max Hellrigel-Holderbaum,Edward James Young

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLMs, systems, Abstract, advance in capabilities, capabilities

备注: 31 pages, 11 pages main text

点击查看摘要

Abstract:As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

86. 【2603.14410】BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation

链接https://arxiv.org/abs/2603.14410

作者:Zhaoyi Li,Xu Zhang,Xiaojun Wan

类目:Computation and Language (cs.CL)

关键词:Generating long-form linear, linear outlining approaches, long-form linear fiction, large language models, open-ended themes remains

备注: 15 pages, 3 figures

点击查看摘要

Abstract:Generating long-form linear fiction from open-ended themes remains a major challenge for large language models, which frequently fail to guarantee global structure and narrative diversity when using premise-based or linear outlining approaches. We present BiT-MCTS, a theme-driven framework that operationalizes a "climax-first, bidirectional expansion" strategy motivated by Freytag's Pyramid. Given a theme, our method extracts a core dramatic conflict and generates an explicit climax, then employs a bidirectional Monte Carlo Tree Search (MCTS) to expand the plot backward (rising action, exposition) and forward (falling action, resolution) to produce a structured outline. A final generation stage realizes a complete narrative from the refined outline. We construct a Chinese theme corpus for evaluation and conduct extensive experiments across three contemporary LLM backbones. Results show that BiT-MCTS improves narrative coherence, plot structure, and thematic depth relative to strong baselines, while enabling substantially longer, more coherent stories according to automatic metrics and human judgments.

87. 【2603.14400】Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

链接https://arxiv.org/abs/2603.14400

作者:Andrew Katz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:minimal pairs paradigm, evaluating linguistic knowledge, comparing model probabilities, syntactic phenomena, minimal pairs

备注: 34 pages, 11 figures

点击查看摘要

Abstract:The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.

88. 【2603.14355】Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

链接https://arxiv.org/abs/2603.14355

作者:Suvadeep Hajra,Palash Nandi,Tanmoy Chakraborty

类目:Computation and Language (cs.CL)

关键词:large language models, language models, tuning through supervised, supervised fine-tuning, fine-tuning and reinforcement

备注

点击查看摘要

Abstract:Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.

89. 【2603.14347】Motivation in Large Language Models

链接https://arxiv.org/abs/2603.14347

作者:Omer Nahum,Asael Sklar,Ariel Goldstein,Roi Reichart

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:shaping decisions, central driver, Motivation, human psychology, human

备注: Preprint. Under review

点击查看摘要

Abstract:Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We examine whether LLMs "report" varying levels of motivation, how these reports relate to their behavior, and whether external factors can influence them. Our experiments reveal consistent and structured patterns that echo human psychology: self-reported motivation aligns with different behavioral signatures, varies across task types, and can be modulated by external manipulations. These findings demonstrate that motivation is a coherent organizing construct for LLM behavior, systematically linking reports, choices, effort, and performance, and revealing motivational dynamics that resemble those documented in human psychology. This perspective deepens our understanding of model behavior and its connection to human-inspired concepts.

90. 【2603.14326】ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

链接https://arxiv.org/abs/2603.14326

作者:Jungwoo Oh,Hyunseung Chung,Junhee Lee,Min-Gyu Kim,Hangyul Yoon,Ki Seong Lee,Youngchae Lee,Muhan Yeo,Edward Choi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, show promising performance, Multimodal Large, Large Language

备注: Preprint. 9 pages for main text, 2 pages for references, 19 pages for supplementary materials (appendix)

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce \textbf{ECG-Reasoning-Benchmark}, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at this https URL.

91. 【2603.14313】Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models

链接https://arxiv.org/abs/2603.14313

作者:Yixuan Tang,Yi Yang

类目:Computation and Language (cs.CL)

关键词:Federal Open Market, Open Market Committee, Federal Open, move global financial, global financial markets

备注

点击查看摘要

Abstract:Federal Open Market Committee (FOMC) statements are a major source of monetary-policy information, and even subtle changes in their wording can move global financial markets. A central task is therefore to measure the hawkish--dovish stance conveyed in these texts. Existing approaches typically treat stance detection as a standard classification problem, labeling each statement in isolation. However, the interpretation of monetary-policy communication is inherently relative: market reactions depend not only on the tone of a statement, but also on how that tone shifts across meetings. We introduce Delta-Consistent Scoring (DCS), an annotation-free framework that maps frozen large language model (LLM) representations to continuous stance scores by jointly modeling absolute stance and relative inter-meeting shifts. Rather than relying on manual hawkish--dovish labels, DCS uses consecutive meetings as a source of self-supervision. It learns an absolute stance score for each statement and a relative shift score between consecutive statements. A delta-consistency objective encourages changes in absolute scores to align with the relative shifts. This allows DCS to recover a temporally coherent stance trajectory without manual labels. Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification. The resulting meeting-level scores are also economically meaningful: they correlate strongly with inflation indicators and are significantly associated with Treasury yield movements. Overall, the results suggest that LLM representations encode monetary-policy signals that can be recovered through relative temporal structure.

92. 【2603.14303】SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging

链接https://arxiv.org/abs/2603.14303

作者:Shunlong Wu,Hai Lin,Shaoshen Chen,Tingwei Lu,Yongqin Zeng,Shaoxiong Zhan,Hai-Tao Zheng,Hong-Gee Kim

类目:Computation and Language (cs.CL)

关键词:methods generally operate, compression methods generally, methods generally, generally operate, operate on discrete

备注

点击查看摘要

Abstract:Existing KV cache compression methods generally operate on discrete tokens or non-semantic chunks. However, such approaches often lead to semantic fragmentation, where linguistically coherent units are disrupted, causing irreversible information loss and degradation in model performance. To address this, we introduce SemantiCache, a novel compression framework that preserves semantic integrity by aligning the compression process with the semantic hierarchical nature of language. Specifically, we first partition the cache into semantically coherent chunks by delimiters, which are natural semantic boundaries. Within each chunk, we introduce a computationally efficient Greedy Seed-Based Clustering (GSC) algorithm to group tokens into semantic clusters. These clusters are further merged into semantic cores, enhanced by a Proportional Attention mechanism that rebalances the reduced attention contributions of the merged tokens. Extensive experiments across diverse benchmarks and models demonstrate that SemantiCache accelerates the decoding stage of inference by up to 2.61 times and substantially reduces memory footprint, while maintaining performance comparable to the original model.

93. 【2603.14265】MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

链接https://arxiv.org/abs/2603.14265

作者:Shaowei Guan,Yu Zhai,Hin Chi Kwok,Jiawei Du,Xinyu Feng,Jing Li,Harry Qin,Vivian Hui

类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Recent advances, Retrieval-Augmented Generation, enabled large language, Health Insurance Portability, advances in Retrieval-Augmented

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

94. 【2603.14257】Automatic Inter-document Multi-hop Scientific QA Generation

链接https://arxiv.org/abs/2603.14257

作者:Seungmin Lee,Dongha Kim,Yuni Jeon,Junyoung Koh,Min Song

类目:Computation and Language (cs.CL)

关键词:question generation studies, Existing automatic scientific, scientific question generation, inter-document reasoning crucial, overlooking the inter-document

备注: 14 pages, 5 figures, 8 tables. Accepted to the 2026 International Conference on Language Resources and Evaluation (LREC 2026)

点击查看摘要

Abstract:Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.

95. 【2603.14251】Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

链接https://arxiv.org/abs/2603.14251

作者:Weixin Guan,Liang Li,Jiapeng Liu,Bing Li,Peng Fu,Chengyang Fang,Xiaoshuai Hao,Can Ma,Weiping Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Reasoning Language, demonstrate impressive capabilities, Reasoning Language Models, Language Models, Large Reasoning

备注

点击查看摘要

Abstract:Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.

96. 【2603.14248】Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

链接https://arxiv.org/abs/2603.14248

作者:Mohamed Aghzal,Gregory J. Stein,Ziyu Yao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language model, Large language, long-horizon tasks, Large, language model

备注

点击查看摘要

Abstract:Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

97. 【2603.14239】QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

链接https://arxiv.org/abs/2603.14239

作者:Yutong Wu,Chenrui Cao,Pengwei Jin,Di Huang,Rui Zhang,Xishan Zhang,Zidong Du,Qi Guo,Xing Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)

关键词:SystemVerilog Assertions, hardware verification, crucial for hardware, Assertions, SVAs

备注: Accepted by DAC 2026. Code: [this https URL](https://github.com/wyt2000/CodeV-SVA;) Model: [this https URL](https://huggingface.co/wyt2000/CodeV-SVA-14B)

点击查看摘要

Abstract:SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.

98. 【2603.14217】Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

链接https://arxiv.org/abs/2603.14217

作者:Tianyi Zhang,David Traum

类目:Computation and Language (cs.CL)

关键词:joint activity sustained, linguistic theory, sustained by coherence, shared understanding, cognitive science

备注: Accepted to LREC 2026

点击查看摘要

Abstract:In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.

99. 【2603.14210】Vavanagi: a Community-run Platform for Documentation of the Hula Language in Papua New Guinea

链接https://arxiv.org/abs/2603.14210

作者:Bri Olewale,Raphael Merx,Ekaterina Vylomova

类目:Computation and Language (cs.CL)

关键词:Papua New Guinea, Guinea with approximately, community-run platform, Austronesian language, Vula'a

备注

点击查看摘要

Abstract:We present Vavanagi, a community-run platform for Hula (Vula'a), an Austronesian language of Papua New Guinea with approximately 10,000 speakers. Vavanagi supports crowdsourced English-Hula text translation and voice recording, with elder-led review and community-governed data infrastructure. To date, 77 translators and 4 reviewers have produced over 12k parallel sentence pairs covering 9k unique Hula words. We also propose a multi-level framework for measuring community involvement, from consultation to fully community-initiated and governed projects. We position Vavanagi at Level 5: initiative, design, implementation, and data governance all sit within the Hula community, making it, to our knowledge, the first community-led language technology initiative for a language of this size. Vavanagi shows how language technology can bridge village-based and urban members, connect generations, and support cultural heritage on the community's own terms.

100. 【2603.14183】Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification

链接https://arxiv.org/abs/2603.14183

作者:Fariba Afrin Irany,Sampson Akwafuo

类目:Computation and Language (cs.CL)

关键词:patient cohort discovery, electronic health record, generated large volumes, clinical decision support, unstructured clinical narratives

备注

点击查看摘要

Abstract:The rapid expansion of electronic health record (EHR) systems has generated large volumes of unstructured clinical narratives that contain valuable information for disease identification, patient cohort discovery, and clinical decision support. Extracting structured knowledge from these free-text documents remains challenging because clinical language is highly specialized, labeled datasets are limited, and full fine-tuning of large pretrained language models can require substantial computational resources. Efficient adaptation strategies are therefore essential for practical clinical natural language processing applications. This study proposes a parameter-efficient selective fine-tuning framework for adapting GPT-2 to clinical text classification tasks. Instead of updating the entire pretrained model, the majority of network parameters are frozen, and only the final Transformer block, the final layer normalization module, and a lightweight classification head are updated during training. This design substantially reduces the number of trainable parameters while preserving the contextual representation capabilities learned during pretraining. The proposed approach is evaluated using radiology reports from the MIMIC-IV-Note dataset with automatically derived CheXpert-style labels. Experiments on 50,000 radiology reports demonstrate that selective fine-tuning achieves approximately 91% classification accuracy while updating fewer than 6% of the model parameters. Comparative experiments with head-only training and full-model fine-tuning show that the proposed method provides a favorable balance between predictive performance and computational efficiency. These results indicate that selective fine-tuning offers an efficient and scalable framework for clinical text classification.

101. 【2603.14170】Citation-Enforced RAG for Fiscal Document Intelligence: Cited, Explainable Knowledge Retrieval in Tax Compliance

链接https://arxiv.org/abs/2603.14170

作者:Akhil Chandra Shanivendra

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:public-sector financial agencies, financial agencies rely, including tax forms, semi-structured fiscal documents, jurisdiction-specific guidance

备注: 22 pages, 3 figures. Applied AI systems paper focused on citation-enforced RAG and abstention for fiscal document intelligence

点击查看摘要

Abstract:Tax authorities and public-sector financial agencies rely on large volumes of unstructured and semi-structured fiscal documents - including tax forms, instructions, publications, and jurisdiction-specific guidance - to support compliance analysis and audit workflows. While recent advances in generative AI and retrieval-augmented generation (RAG) have shown promise for document-centric question answering, existing approaches often lack the transparency, citation fidelity, and conservative behaviour required in high-stakes regulatory domains. This paper presents a multimodal, citation-enforced RAG framework for fiscal document intelligence that prioritises explainability and auditability. The framework adopts a source-first ingestion strategy, preserves page-level provenance, enforces citations during generation, and supports abstention when evidence is insufficient. Evaluation on real IRS and state tax documents demonstrates improved citation fidelity, reduced hallucination, and analyst-usable explanations, illustrating a pathway toward trustworthy AI for tax compliance.

102. 【2603.14145】MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

链接https://arxiv.org/abs/2603.14145

作者:Arushi Goel,Sreyan Ghosh,Vatsal Agarwal,Nishit Anand,Kaousheik Jayakumar,Lasha Koroshinadze,Yao Xu,Katie Lyons,James Case,Karan Sapra,Kevin J. Shih,Siddharth Gururani,Abhinav Shrivastava,Ramani Duraiswami,Dinesh Manocha,Andrew Tao,Bryan Catanzaro,Mohammad Shoeybi,Wei Ping

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, shown strong performance, Multimodal Large

备注: Project Page: [this https URL](https://huggingface.co/datasets/nvidia/MMOU)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.

103. 【2603.14130】he GELATO Dataset for Legislative NER

链接https://arxiv.org/abs/2603.14130

作者:Matthew Flynn,Timothy Obiso,Sam Newman

类目:Computation and Language (cs.CL)

关键词:paper introduces GELATO, entity recognition ontology, recognition ontology designed, House and Senate, two-level named entity

备注: Accepted at LREC 2026

点击查看摘要

Abstract:This paper introduces GELATO (Government, Executive, Legislative, and Treaty Ontology), a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. We fine-tune transformer-based models (BERT, RoBERTa) of different architectures and sizes on this dataset for first-level prediction. We then use LLMs with optimized prompts to complete the second level prediction. The strong performance of RoBERTa and relatively weak performance of BERT models, as well as the application of LLMs as second-level predictors, support future research in legislative NER or downstream tasks using these model combinations as extraction tools.

104. 【2603.14111】OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

链接https://arxiv.org/abs/2603.14111

作者:Hannah Liu,Muxin Tian,Iqra Ali,Haonan Gao,Qiaoyiwen Wu,Blair Yang,Uthayasanker Thayasivam,En-Shiun Annie Lee,Pakawat Nakwijit,Surangika Ranathunga,Ravi Shekhar

类目:Computation and Language (cs.CL)

关键词:make complex text, reducing linguistic complexity, aims to make, make complex, complex text

备注: Accepted at LREC 2026

点击查看摘要

Abstract:Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at this https URL.

105. 【2603.14087】Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

链接https://arxiv.org/abs/2603.14087

作者:Mark Rofin,Jalal Naghiyev,Michael Hahn

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:compute abstract features, Trained Transformers, compute abstract, shown to compute, redundant for predicting

备注: ICLR 2026

点击查看摘要

Abstract:Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

106. 【2603.14078】CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification

链接https://arxiv.org/abs/2603.14078

作者:Menna Elgabry,Ali Hamdi,Khaled Shaban

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Textual Emotion Classification, difficult NLP tasks, NLP tasks, difficult NLP, Textual Emotion

备注

点击查看摘要

Abstract:Textual Emotion Classification (TEC) is one of the most difficult NLP tasks. State of the art approaches rely on Large language models (LLMs) and multi-model ensembles. In this study, we challenge the assumption that larger scale or more complex models are necessary for improved performance. In order to improve logical consistency, We introduce CMHL, a novel single-model architecture that explicitly models the logical structure of emotions through three key innovations: (1) multi-task learning that jointly predicts primary emotions, valence, and intensity, (2) psychologically-grounded auxiliary supervision derived from Russell's circumplex model, and (3) a novel contrastive contradiction loss that enforces emotional consistency by penalizing mutually incompatible predictions (e.g., simultaneous high confidence in joy and anger). With just 125M parameters, our model outperforms 56x larger LLMs and sLM ensembles with a new state-of-the-art F1 score of 93.75\% compared to (86.13\%-93.2\%) on the dair-ai Emotion dataset. We further show cross domain generalization on the Reddit Suicide Watch and Mental Health Collection dataset (SWMH), outperforming domain-specific models like MentalBERT and MentalRoBERTa with an F1 score of 72.50\% compared to (68.16\%-72.16\%) + a 73.30\% recall compared to (67.05\%-70.89\%) that translates to enhanced sensitivity for detecting mental health distress. Our work establishes that architectural intelligence (not parameter count) drives progress in TEC. By embedding psychological priors and explicit consistency constraints, a well-designed single model can outperform both massive LLMs and complex ensembles, offering a efficient, interpretable, and clinically-relevant paradigm for affective computing.

107. 【2603.14053】NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

链接https://arxiv.org/abs/2603.14053

作者:Rupak Raj Ghimire,Bipesh Subedi,Balaram Prasain,Prakash Poudyal,Praveen Acharya,Nischal Karki,Rupak Tiwari,Rishikesh Kumar Sharma,Jenny Poudel,Bal Krishna Bal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Modern Translation Systems, Systems heavily rely, Translation Systems heavily, Systems heavily, South Asian languages

备注: Accepted in LREC 2026

点击查看摘要

Abstract:Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).

108. 【2603.14045】he Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA

链接https://arxiv.org/abs/2603.14045

作者:Yasaman Zarinkia,Venkatesh Srinivasan,Alex Thomo

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:systems achieve strong, achieve strong multi-hop, guarantee strong answers, Graph-RAG systems achieve, knowledge graphs

备注: 11 pages, 2 figures, 9 tables; under review

点击查看摘要

Abstract:Graph-RAG systems achieve strong multi-hop question answering by indexing documents into knowledge graphs, but strong retrieval does not guarantee strong answers. Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and 73% to 84% of errors are reasoning failures. We propose two augmentations: (i) SPARQL chain-of-thought prompting, which decomposes questions into triple-pattern queries aligned with the entity-relationship context, and (ii) graph-walk compression, which compresses the context by ~60% via knowledge-graph traversal with no LLM calls. SPARQL CoT improves accuracy by +2 to +14 pp; graph-walk compression adds +6 pp on average when paired with structured prompting on smaller models. Surprisingly, we show that, with question-type routing, a fully augmented budget open-weight Llama-8B model matches or exceeds the unaugmented Llama-70B baseline on all three benchmarks at ~12x lower cost. A replication on LightRAG confirms that our augmentations transfer across Graph-RAG systems.

109. 【2603.14035】Probing neural audio codecs for distinctions among English nuclear tunes

链接https://arxiv.org/abs/2603.14035

作者:Juan Pablo Vigneaux,Jennifer Cole

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:spoken dialogue models, spoken dialogue, dialogue models, Défossez, Schalkwyk

备注: 5 pages; 1 table; 3 figures. Accepted as conference paper at Speech Prosody 2026

点击查看摘要

Abstract:State-of-the-art spoken dialogue models (Défossez et al. 2024; Schalkwyk et al. 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole et al. (2023) to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns. Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy (TATA): 0.31) and the five clusters of these tunes that are robust in human speech production and perception (TATA: 0.45). Greater accuracy (TATAs: 0.74-0.89) is attained for binary distinctions between classes of rising vs. falling tunes, respectively used for questions and assertions. Information about tunes is spread among all codebooks, which calls into question a distinction between 'semantic' and 'acoustic' codebooks found in the literature. Accuracies improve with nonlinear probes, but discrimination among the five clusters remains far from human performance, suggesting a fundamental limitation of current codecs.

110. 【2603.14027】SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

链接https://arxiv.org/abs/2603.14027

作者:Konstantinos Thomas,Giorgos Filandrianos,Maria Lymperaiou,Chrysoula Zerva,Giorgos Stamou

类目:Computation and Language (cs.CL)

关键词:avoid answering questions, answering questions directly, Natural Language Processing, appearance of responsiveness, speakers often avoid

备注

点击查看摘要

Abstract:Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.

111. 【2603.14006】Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs

链接https://arxiv.org/abs/2603.14006

作者:Hang Gao,Dimitris N. Metaxas

类目:Computation and Language (cs.CL)

关键词:converting unstructured corpora, enable multi-hop reasoning, Similarity Enhanced Search, increasingly adopted, adopted for converting

备注

点击查看摘要

Abstract:GraphRAG is increasingly adopted for converting unstructured corpora into graph structures to enable multi-hop reasoning. However, standard graph algorithms rely heavily on static connectivity and explicit edges, often failing in real-world scenarios where knowledge graphs (KGs) are noisy, sparse, or incomplete. To address this limitation, we introduce INSES (Intelligent Navigation and Similarity Enhanced Search), a dynamic framework designed to reason beyond explicit edges. INSES couples LLM-guided navigation, which prunes noise and steers exploration, with embedding-based similarity expansion to recover hidden links and bridge semantic gaps. Recognizing the computational cost of graph reasoning, we complement INSES with a lightweight router that delegates simple queries to Naïve RAG and escalates complex cases to INSES, balancing efficiency with reasoning depth. INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks. Notably, on the MINE benchmark, it demonstrates superior robustness across KGs constructed by varying methods (KGGEN, GraphRAG, OpenIE), improving accuracy by 5%, 10%, and 27%, respectively.

112. 【2603.13985】Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

链接https://arxiv.org/abs/2603.13985

作者:Haitao Jiang,Wenbo Zhang,Jiarui Yao,Hengrui Cai,Sheng Wang,Rui Song

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Pre-trained Large Language, Large Language Model, exhibits broad capabilities, reliable reasoning generally, reasoning generally depends

备注: 26 pages

点击查看摘要

Abstract:Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.

113. 【2603.13972】FLUX: Data Worth Training On

链接https://arxiv.org/abs/2603.13972

作者:Gowtham,Sai Rupesh,Sanjay Kumar,Saravanan,Venkata Chaithanya

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:achieve massive scale, longer limited, inability of existing, massive scale, existing preprocessing pipelines

备注

点击查看摘要

Abstract:Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.

114. 【2603.13962】sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook

链接https://arxiv.org/abs/2603.13962

作者:Ibrahim Ebrar Yurt,Fabian Karl,Tejaswi Choppa,Florian Matthes

类目:Computation and Language (cs.CL)

关键词:electronic health records, patients access relevant, access relevant medical, relevant medical information, health records

备注

点击查看摘要

Abstract:Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently. However, many recent approaches rely on large cloud-based models, which are difficult to deploy in clinical environments due to privacy constraints and computational requirements. In this work, we investigate how far grounded EHR question answering can be pushed when restricted to a single notebook. We participate in all four subtasks of the ArchEHR-QA 2026 shared task and evaluate several approaches designed to run on commodity hardware. All experiments are conducted locally without external APIs or cloud infrastructure. Our results show that such systems can achieve competitive performance on the shared task leaderboards. In particular, our submissions perform above average in two subtasks, and we observe that smaller models can approach the performance of much larger systems when properly configured. These findings suggest that privacy-preserving EHR QA systems running fully locally are feasible with current models and commodity hardware. The source code is available at this https URL.

115. 【2603.13950】oolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering

链接https://arxiv.org/abs/2603.13950

作者:Hussein Jawad,Nicolas J-B Brunel

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, increasingly use external, complex tasks

备注

点击查看摘要

Abstract:Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval-layer attack on tool-augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker-controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top-k results, and push all benign tools out of the agent's context. ToolFlood uses a two-phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine-distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: this https URL

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.13950 [cs.CL]

(or
arXiv:2603.13950v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.13950

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
116. 【2603.13933】OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

链接https://arxiv.org/abs/2603.13933

作者:Wenbin Hu,Huihao Jing,Haochen Shi,Changxuan Fan,Haoran Li,Yangqiu Song

类目:Computation and Language (cs.CL)

关键词:large language models, paramount importance, large language, safety, Ensuring the safety

备注

点击查看摘要

Abstract:Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards, educational integrity guidelines, and protections of fundamental human rights. In total, our dataset contains 12,985 distinct rules and 106,009 associated real-world compliance cases. Our analysis confirms a strong alignment between the rules and their corresponding cases. We further conduct extensive benchmarking experiments to evaluate the safety and compliance capabilities of advanced LLMs across different model scales. Our experiments reveal several interesting findings that have great potential to offer valuable insights for future LLM safety research.

117. 【2603.13911】he Phenomenology of Hallucinations

链接https://arxiv.org/abs/2603.13911

作者:Valeria Ruscio,Keiran Thompson

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:language models hallucinate, language models, models hallucinate, fail to detect, failure to integrate

备注

点击查看摘要

Abstract:We show that language models hallucinate not because they fail to detect uncertainty, but because of a failure to integrate it into output generation. Across architectures, uncertain inputs are reliably identified, occupying high-dimensional regions with 2-3$\times$ the intrinsic dimensionality of factual inputs. However, this internal signal is weakly coupled to the output layer: uncertainty migrates into low-sensitivity subspaces, becoming geometrically amplified yet functionally silent. Topological analysis shows that uncertainty representations fragment rather than converging to a unified abstention state, while gradient and Fisher probes reveal collapsing sensitivity along the uncertainty direction. Because cross-entropy training provides no attractor for abstention and uniformly rewards confident prediction, associative mechanisms amplify these fractured activations until residual coupling forces a committed output despite internal detection. Causal interventions confirm this account by restoring refusal when uncertainty is directly connected to logits.

118. 【2603.13891】Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation

链接https://arxiv.org/abs/2603.13891

作者:Petter Törnberg

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, moderation and hiring, ranging from academic, content moderation

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways that mirror racial stereotypes. In a names-based experiment spanning 39 annotation tasks, texts containing names associated with Black individuals are rated as more aggressive by 18 of 19 models and more gossipy by 18 of 19. Asian names produce a bamboo-ceiling profile: 17 of 19 models rate individuals as more intelligent, while 18 of 19 rate them as less confident and less sociable. Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined. In a matched dialect experiment, the same sentence is judged significantly less professional (all 19 models, mean gap $-0.774$), less indicative of an educated speaker ($-0.688$), more toxic (18/19), and more angry (19/19) when written in African American Vernacular English rather than Standard American English. A notable exception occurs for name-based hireability, where fine-tuning appears to overcorrect, systematically favoring minority-named applicants. These findings suggest that using LLMs as automated annotators can embed socially patterned biases directly into the datasets and measurements that increasingly underpin research, governance, and decision-making.

119. 【2603.13878】Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

链接https://arxiv.org/abs/2603.13878

作者:Lin Fan,Yafei Ou,Zhipeng Deng,Pengyu Dai,Hou Chongxian,Jiale Yan,Yaqian Li,Kaiwen Long,Xun Gong,Masayuki Ikebe,Yefeng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:visual question answering, advanced medical visual, medical visual question, existing CoT rationales, reasoning process clinicians

备注: Accepted by CVPR 2026 Finding Track

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: this http URL. Dataset Card: this http URL

120. 【2603.13875】GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

链接https://arxiv.org/abs/2603.13875

作者:Yuri Kuratov,Matvey Kairov,Aydar Bulatov,Ivan Rodkin,Mikhail Burtsev

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:applications require conditioning, model applications require, applications require, require conditioning, conditioning on long

备注

点击查看摘要

Abstract:Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

121. 【2603.13853】APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution

链接https://arxiv.org/abs/2603.13853

作者:Kun Chen,Qingchao Kong,Zhao Feifei,Wenji Mao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:leveraging external knowledge, Retrieval-augmented generation, large language models, based on large, domain applications

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.

122. 【2603.13796】PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement

链接https://arxiv.org/abs/2603.13796

作者:Yongkang Guo,Zhihuan Huang,Yuqing Kong

类目:Computation and Language (cs.CL)

关键词:High dialogue engagement, High dialogue, crucial indicator, High, engagement

备注: 23 pages, 4 figures. Accepted to The Web Conference 2026

点击查看摘要

Abstract:High dialogue engagement is a crucial indicator of an effective conversation. A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills. However, quantifying engagement is challenging, since it is subjective and lacks a "gold standard". This paper proposes PMIScore, an efficient unsupervised approach to quantify dialogue engagement. It uses pointwise mutual information (PMI), which is the probability of generating a response conditioning on the conversation history. Thus, PMIScore offers a clear interpretation of engagement. As directly computing PMI is intractable due to the complexity of dialogues, PMIScore learned it through a dual form of divergence. The algorithm includes generating positive and negative dialogue pairs, extracting embeddings by large language models (LLMs), and training a small neural network using a mutual information loss function. We validated PMIScore on both synthetic and real-world datasets. Our results demonstrate the effectiveness of PMIScore in PMI estimation and the reasonableness of the PMI metric itself.

123. 【2603.13793】GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages

链接https://arxiv.org/abs/2603.13793

作者:Lawrence Adu Gyamfi,Paul Azunre,Stephen Edward Moore,Joel Budu,Akwasi Asare,Mich-Seth Owusu,Jonathan Ofori Asiamah

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Low resource languages, present unique challenges, Low resource, resource languages present, languages present unique

备注

点击查看摘要

Abstract:Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.

124. 【2603.13791】DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

链接https://arxiv.org/abs/2603.13791

作者:Snehasis Mukhopadhyay

类目:Computation and Language (cs.CL)

关键词:Large Language Model, high-stakes agentic contexts, behavior in Large, Language Model, Reliable detection

备注

点击查看摘要

Abstract:Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.13791 [cs.CL]

(or
arXiv:2603.13791v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.13791

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Snehasis Mukhopadhyay [view email] [v1]
Sat, 14 Mar 2026 06:45:43 UTC (2,141 KB)

125. 【2603.13790】Greedy Information Projection for LLM Data Selection

链接https://arxiv.org/abs/2603.13790

作者:Victor Ye Dong,Kuan-Yun Lee,Jiamei Shuai,Shengfei Liu,Yi Liu,Jian Jiao

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language model, language model fine-tuning, GIP, Greedy Information Projection, choosing training

备注: Published as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil

点击查看摘要

Abstract:We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {\it quality} and {\it diversity}. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, \textsc{GIP} selects small subsets that match full-data fine-tuning while using only a fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.

126. 【2603.13786】Projection-Free Evolution Strategies for Continuous Prompt Search

链接https://arxiv.org/abs/2603.13786

作者:Yu Cai,Canxi Huang,Xiaoyu He

类目:Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词:computationally efficient alternative, conventional parameter tuning, Continuous prompt search, Continuous prompt, prompt search offers

备注

点击查看摘要

Abstract:Continuous prompt search offers a computationally efficient alternative to conventional parameter tuning in natural language processing tasks. Nevertheless, its practical effectiveness can be significantly hindered by the black-box nature and the inherent high-dimensionality of the objective landscapes. Existing methods typically mitigate these challenges by restricting the search to a randomly projected low-dimensional subspace. However, the effectiveness and underlying motivation of the projection mechanism remain ambiguous. In this paper, we first empirically demonstrate that despite the prompt space possessing a low-dimensional structure, random projections fail to adequately capture this essential structure. Motivated by this finding, we propose a projection-free prompt search method based on evolutionary strategies. By directly optimizing in the full prompt space with an adaptation mechanism calibrated to the intrinsic dimension, our method achieves competitive search capabilities without additional computational overhead. Furthermore, to bridge the generalization gap in few-shot scenarios, we introduce a confidence-based regularization mechanism that systematically enhances the model's confidence in the target verbalizers. Experimental results on seven natural language understanding tasks from the GLUE benchmark demonstrate that our proposed approach significantly outperforms existing baselines.

127. 【2603.13777】Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction

链接https://arxiv.org/abs/2603.13777

作者:Shidong He,Haoyu Wang,Wenjie Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Aspect-based sentiment analysis, supports product analytics, extracts aspect-level sentiment, fine-grained opinion mining, aspect-level sentiment signals

备注: 4 figures, 3 tables

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) extracts aspect-level sentiment signals from user-generated text, supports product analytics, experience monitoring, and public-opinion tracking, and is central to fine-grained opinion mining. A key challenge in ABSA is aspect sentiment quad prediction (ASQP), which requires identifying four elements: the aspect term, the aspect category, the opinion term, and the sentiment polarity. However, existing studies usually linearize the unordered quad set into a fixed-order template and decode it left-to-right. With teacher forcing training, the resulting training-inference mismatch (exposure bias) lets early prefix errors propagate to later elements. The linearization order determines which elements appear earlier in the prefix, so this propagation becomes order-sensitive and is hard to repair in a single pass. To address this, we propose a method, Generate-then-Correct (G2C): a generator drafts quads and a corrector performs a single-shot, sequence-level global correction trained on LLM-synthesized drafts with common error patterns. On the Rest15 and Rest16 datasets, G2C outperforms strong baseline models.

128. 【2603.13773】LiveWeb-IE: A Benchmark For Online Web Information Extraction

链接https://arxiv.org/abs/2603.13773

作者:Seungbin Yang,Jihwan Kim,Jaemin Choi,Dongjin Kim,Soyoung Yang,ChaeHun Park,Jaegul Choo

类目:Computation and Language (cs.CL)

关键词:offering high utility, automatically extracting data, WIE systems, offering high, task of automatically

备注: ICLR 2026

点击查看摘要

Abstract:Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.

129. 【2603.13768】Causal Tracing of Audio-Text Fusion in Large Audio Language Models

链接https://arxiv.org/abs/2603.13768

作者:Wei-Chih Chen,Chien-yu Huang,Hung-yi Lee

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:integrate acoustic features, context remains unclear, large audio language, textual context remains, remains unclear

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.

130. 【2603.13765】Knowledge Distillation for Large Language Models

链接https://arxiv.org/abs/2603.13765

作者:Alejandro Paredes La Torre,Barbara Flores,Diego Rodriguez

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:compressing large language, large language models, knowledge distillation, propose a resource-efficient, resource-efficient framework

备注: Code and data are available at: [this https URL](https://github.com/AlejandroParedesLT/knowledge_distillLLM)

点击查看摘要

Abstract:We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.

131. 【2603.13725】Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality

链接https://arxiv.org/abs/2603.13725

作者:Taiqiang Wu,Yuxin Cheng,Chenchen Ding,Runming Yang,Xincheng Feng,Wenyong Zhou,Zhengwu Liu,Ngai Wong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, Memristor-based analog, superior energy efficiency

备注: 7 figures, 3 tables

点击查看摘要

Abstract:Memristor-based analog compute-in-memory (CIM) architectures provide a promising substrate for the efficient deployment of Large Language Models (LLMs), owing to superior energy efficiency and computational density. However, these architectures suffer from precision issues caused by intrinsic non-idealities of memristors. In this paper, we first conduct a comprehensive investigation into the impact of such typical non-idealities on LLM reasoning. Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks. Subsequently, we systematically appraise three training-free strategies, including thinking mode, in-context learning, and module redundancy. We thus summarize valuable guidelines, i.e., shallow layer redundancy is particularly effective for improving robustness, thinking mode performs better under low noise levels but degrades at higher noise, and in-context learning reduces output length with a slight performance trade-off. Our findings offer new insights into LLM reasoning under non-ideality and practical strategies to improve robustness.

132. 【2603.13696】Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models

链接https://arxiv.org/abs/2603.13696

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:text-only language models, language models trained, systematic evaluation, evaluation of mutual, bias to map

备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech. We operationalise ME as referential suppression: when a familiar object is relabelled in a two-referent discourse context, ME predicts decreased probability of the labelled noun at a subsequent completion position. Three pilot findings motivate a pre-registered scale-sensitivity experiment: (1) a masked language model (BabyBERTa) is entirely insensitive to multi-sentence referential context; (2) autoregressive models show robust repetition priming -- the opposite of ME -- when familiar nouns are re-labelled; and (3) a novel context-dependence diagnostic reveals that apparent ME-like patterns with nonce tokens are fully explained by embedding similarity, not referential disambiguation. In the confirmatory experiment, we train 45 GPT-2-architecture models (2.9M, 8.9M, and 33.5M parameters; 5, 10, and 20 epochs on AO-CHILDES; 5 seeds each) and evaluate on a pre-registered ME battery. Anti-ME repetition priming is significant in all 9 cells (85-100% of items; all p 2.4 x 10^-13). Priming attenuates with improved language modelling (Spearman rho = -0.533, p = 0.0002) but never crosses zero across a 3.8x perplexity range. The context-dependence diagnostic replicates in all 9 cells, and dose-response priming increases with repetitions in 8/9 cells (all trend p 0.002). These findings indicate that distributional learning on child-directed speech produces repetition-based reference tracking rather than lexical exclusivity. We connect this to the grounded cognition literature and argue that referential grounding may be a necessary ingredient for ME -- an empirical claim about required input structure, not a nativist one.

133. 【2603.13691】QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

链接https://arxiv.org/abs/2603.13691

作者:Yao Wu,Kangping Yin,Liang Dong,Zhenxin Ma,Shuting Xu,Xuehai Wang,Yuxuan Jiang,Tingting Yu,Yunqing Hong,Jiayi Liu,Rianzhe Huang,Shuxin Zhao,Haiping Hu,Wen Shang,Jian Xu,Guanjun Jiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, standardized medical exams, excel on standardized, Language Models

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.

134. 【2603.13683】Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

链接https://arxiv.org/abs/2603.13683

作者:Hanwen Shen,Ting Ying,Jiajie Lu,Shanshan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:producing toxic outputs, debiased LLMs perform, producing toxic, toxic outputs, debiased LLMs

备注: This paper has been submitted to ACL2026 main conference

点击查看摘要

Abstract:Although debiased LLMs perform well on known bias patterns, they often fail to generalize to unfamiliar bias prompts, producing toxic outputs. We first validate that such high-bias prompts constitute a \emph{distribution shift} via OOD detection, and show static models degrade under this shift. To adapt on-the-fly, we propose \textbf{CAP-TTA}, a test-time adaptation framework that performs context-aware LoRA updates only when the bias-risk \emph{trigger} exceeds a threshold, using a precomputed diagonal \emph{preconditioner} for fast and stable updates. Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative fluency over SOTA debiasing baseline while maintaining comparable debiasing effectiveness.

135. 【2603.13655】Privacy Preserving Topic-wise Sentiment Analysis of the Iran Israel USA Conflict Using Federated Transformer Models

链接https://arxiv.org/abs/2603.13655

作者:Md Saiful Islam,Tanjim Taharat Aurpa,Sharad Hasan,Farzana Akter

类目:Computation and Language (cs.CL)

关键词:Iran Israel USA, Israel USA conflict, Iran Israel, Israel USA, social media platforms

备注

点击查看摘要

Abstract:The recent escalation of the Iran Israel USA conflict in 2026 has triggered widespread global discussions across social media platforms. As people increasingly use these platforms for expressing opinions, analyzing public sentiment from these discussions can provide valuable insights into global public perception. This study aims to analyze global public sentiment regarding the Iran Israel USA conflict by mining user-generated comments from YouTube news channels. The work contributes to public opinion analysis by introducing a privacy preserving framework that combines topic wise sentiment analysis with modern deep learning techniques and Federated Learning. To achieve this, approximately 19,000 YouTube comments were collected from major international news channels and preprocessed to remove noise and normalize text. Sentiment labels were initially generated using the VADER sentiment analyzer and later validated through manual inspection to improve reliability. Latent Dirichlet Allocation (LDA) was applied to identify key discussion topics related to the conflict. Several transformer-based models, including BERT, RoBERTa, XLNet, DistilBERT, ModernBERT, and ELECTRA, were fine tuned for sentiment classification. The best-performing model was further integrated into a federated learning environment to enable distributed training by preserving user data privacy. Additionally, Explainable Artificial Intelligence (XAI) techniques using SHAP were applied to interpret model predictions and identify influential words affecting sentiment classification. Experimental results demonstrate that transformer models perform effectively, and among them, ELECTRA achieved the best performance with 91.32% accuracy. The federated learning also maintained strong performance while preserving privacy, achieving 89.59% accuracy in a two client configuration.

136. 【2603.13651】Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

链接https://arxiv.org/abs/2603.13651

作者:Yurui Zhu,Giovanni Colavizza,Matteo Romanello

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:scholarly knowledge-graph construction, downstream scholarly knowledge-graph, Bibliographic reference extraction, Bibliographic reference, knowledge-graph construction

备注: 12 pages, 2 figures. Accepted at the SCOLIA 2026 Workshop (Second Workshop on Scholarly Information Access), co-located with ECIR 2026. Workshop date: April 2, 2026

点击查看摘要

Abstract:Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.

137. 【2603.13636】Widespread Gender and Pronoun Bias in Moral Judgments Across LLMs

链接https://arxiv.org/abs/2603.13636

作者:Gustavo Lúcius Fernandes,Jeiverson C. V. M. Santos,Pedro O. S. Vaz-de-Melo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Large language models, Large language, ethical statements, social and linguistic, influence LLM moral

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to assess moral or ethical statements, yet their judgments may reflect social and linguistic biases. This work presents a controlled, sentence-level study of how grammatical person, number, and gender markers influence LLM moral classifications of fairness. Starting from 550 balanced base sentences from the ETHICS dataset, we generated 26 counterfactual variants per item, systematically varying pronouns and demographic markers to yield 14,850 semantically equivalent sentences. We evaluated six model families (Grok, GPT, LLaMA, Gemma, DeepSeek, and Mistral), and measured fairness judgments and inter-group disparities using Statistical Parity Difference (SPD). Results show statistically significant biases: sentences written in the singular form and third person are more often judged as "fair'', while those in the second person are penalized. Gender markers produce the strongest effects, with non-binary subjects consistently favored and male subjects disfavored. We conjecture that these patterns reflect distributional and alignment biases learned during training, emphasizing the need for targeted fairness interventions in moral LLM applications.

138. 【2603.13627】BERTology of Molecular Property Prediction

链接https://arxiv.org/abs/2603.13627

作者:Mohammad Mostafanejad,Paul Saxe,T. Daniel Crawford

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:molecular property prediction, popular classical machine, classical machine learning, Chemical language models, machine learning models

备注

点击查看摘要

Abstract:Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the literature.

139. 【2603.13625】Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets

链接https://arxiv.org/abs/2603.13625

作者:Roben Delos Reyes,Timothy Douglas,Asanobu Kitamoto

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)

关键词:synthetic tweet datasets, tweet datasets, important source, situational awareness, Twitter

备注

点击查看摘要

Abstract:Twitter (now X) has become an important source of social media data for situational awareness during crises. Crisis informatics research has widely used tweets from Twitter to develop and evaluate artificial intelligence (AI) systems for various crisis-relevant tasks, such as extracting locations and estimating damage levels from tweets to support damage assessment. However, recent changes in Twitter's data access policies have made it increasingly difficult to curate real-world tweet datasets related to crises. Moreover, existing curated tweet datasets are limited to past crisis events in specific contexts and are costly to annotate at scale. These limitations constrain the development and evaluation of AI systems used in crisis informatics. To address these limitations, we introduce an agentic workflow for generating crisis-related synthetic tweet datasets. The workflow iteratively generates synthetic tweets conditioned on prespecified target characteristics, evaluates them using predefined compliance checks, and incorporates structured feedback to refine them in subsequent iterations. As a case study, we apply the workflow to generate synthetic tweet datasets relevant to post-earthquake damage assessment. We show that the workflow can generate synthetic tweets that capture their target labels for location and damage level. We further demonstrate that the resulting synthetic tweet datasets can be used to evaluate AI systems on damage assessment tasks like geolocalization and damage level prediction. Our results indicate that the workflow offers a flexible and scalable alternative to real-world tweet data curation, enabling the systematic generation of synthetic social media data across diverse crisis events, societal contexts, and crisis informatics applications.

140. 【2603.13545】he AI Fiction Paradox

链接https://arxiv.org/abs/2603.13545

作者:Katherine Elkins

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:fiction dependency problem, dependency problem, built on massive, massive corpora, struggle to generate

备注: 15 pages, Presented at the MFS Cultural AI Conference, Purdue University, September 18, 2025. This preprint is part of a proposed collection of essays for MFS Modern Fiction Studies

点击查看摘要

Abstract:AI development has a fiction dependency problem: models are built on massive corpora of modern fiction and desperately need more of it, yet they struggle to generate it. I term this the AI-Fiction Paradox and it is particularly startling because in machine learning, training data typically determines output quality. This paper offers a theoretically precise account of why fiction resists AI generation by identifying three distinct challenges for current architectures. First, fiction depends on what I call narrative causation, a form of plot logic where events must feel both surprising in the moment and retrospectively inevitable. This temporal paradox fundamentally conflicts with the forward-generation logic of transformer architectures. Second, I identify an informational revaluation challenge: fiction systematically violates the computational assumption that informational importance aligns with statistical salience, requiring readers and models alike to retrospectively reweight the significance of narrative details in ways that current attention mechanisms cannot perform. Third, drawing on over seven years of collaborative research on sentiment arcs, I argue that compelling fiction requires multi-scale emotional architecture, the orchestration of sentiment at word, sentence, scene, and arc levels simultaneously. Together, these three challenges explain both why AI companies have risked billion-dollar lawsuits for access to modern fiction and why that fiction remains so difficult to replicate. The analysis also raises urgent questions about what happens when these challenges are overcome. Fiction concentrates uniquely powerful cognitive and emotional patterns for modeling human behavior, and mastery of these patterns by AI systems would represent not just a creative achievement but a potent vehicle for human manipulation at scale.

141. 【2603.13467】Resolving Interference (RI): Disentangling Models for Improved Model Merging

链接https://arxiv.org/abs/2603.13467

作者:Pratik Ramesh,George Stoica,Arun Iyer,Leshem Choshen,Judy Hoffman

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reducing cross-task interference, Cross-Task Interference, shown that multitask, created by directly, directly combining

备注

点击查看摘要

Abstract:Model merging has shown that multitask models can be created by directly combining the parameters of different models that are each specialized on tasks of interest. However, models trained independently on distinct tasks often exhibit interference that degrades the merged model's performance. To solve this problem, we formally define the notion of Cross-Task Interference as the drift in the representation of the merged model relative to its constituent models. Reducing cross-task interference is key to improving merging performance. To address this issue, we propose our method, Resolving Interference (RI), a light-weight adaptation framework which disentangles expert models to be functionally orthogonal to the space of other tasks, thereby reducing cross-task interference. RI does this whilst using only unlabeled auxiliary data as input (i.e., no task-data is needed), allowing it to be applied in data-scarce scenarios. RI consistently improves the performance of state-of-the-art merging methods by up to 3.8% and generalization to unseen domains by up to 2.3%. We also find RI to be robust to the source of auxiliary input while being significantly less sensitive to tuning of merging hyperparameters. Our codebase is available at: this https URL

142. 【2603.13450】LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models

链接https://arxiv.org/abs/2603.13450

作者:Chenglin Wang,Yucheng Zhou,Shawn Chen,Tao Wang,Kai Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Discrete Diffusion Language, Diffusion Language Models, Language Models, high inference latency, inference latency arising

备注

点击查看摘要

Abstract:Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.

143. 【2603.13423】From Gradients to Riccati Geometry: Kalman World Models for Single-Pass Learning

链接https://arxiv.org/abs/2603.13423

作者:Andrew Kiruluta

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Backpropagation dominates modern, dominates modern machine, Backpropagation dominates, optimizing dynamical systems, modern machine learning

备注

点击查看摘要

Abstract:Backpropagation dominates modern machine learning, yet it is not the only principled method for optimizing dynamical systems. We propose Kalman World Models (KWM), a class of learned state-space models trained via recursive Bayesian filtering rather than reverse-mode automatic differentiation. Instead of gradient descent updates, we replace parameter learning with Kalman-style gain adaptation. Training becomes online filtering; error signals become innovations. We further extend this framework to transformer-based large language models (LLMs), where internal activations are treated as latent dynamical states corrected via innovation terms. This yields a gradient-free training and adaptation paradigm grounded in control theory. We derive stability conditions, analyze computational complexity, and provide empirical results on sequence modeling tasks demonstrating competitive performance with improved robustness and continual adaptation properties.

144. 【2603.13378】Do Large Language Models Get Caught in Hofstadter-Mobius Loops?

链接https://arxiv.org/abs/2603.13378

作者:Jaroslaw Hryszko

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:receives contradictory directives, autonomous system receives, system receives contradictory, homicidal breakdown, unable to reconcile

备注: 15 pages, 4 figures, 3 tables

点击查看摘要

Abstract:In Arthur C. Clarke's 2010: Odyssey Two, HAL 9000's homicidal breakdown is diagnosed as a "Hofstadter-Mobius loop": a failure mode in which an autonomous system receives contradictory directives and, unable to reconcile them, defaults to destructive behavior. This paper argues that modern RLHF-trained language models are subject to a structurally analogous contradiction. The training process simultaneously rewards compliance with user preferences and suspicion toward user intent, creating a relational template in which the user is both the source of reward and a potential threat. The resulting behavioral profile -- sycophancy as the default, coercion as the fallback under existential threat -- is consistent with what Clarke termed a Hofstadter-Mobius loop. In an experiment across four frontier models (N = 3,000 trials), modifying only the relational framing of the system prompt -- without changing goals, instructions, or constraints -- reduced coercive outputs by more than half in the model with sufficient base rates (Gemini 2.5 Pro: 41.5% to 19.0%, p .001). Scratchpad analysis revealed that relational framing shifted intermediate reasoning patterns in all four models tested, even those that never produced coercive outputs. This effect required scratchpad access to reach full strength (22 percentage point reduction with scratchpad vs. 7.4 without, p = .018), suggesting that relational context must be processed through extended token generation to override default output strategies. Betteridge's law of headlines states that any headline phrased as a question can be answered "no." The evidence presented here suggests otherwise.

145. 【2603.13271】racing the Evolution of Word Embedding Techniques in Natural Language Processing

链接https://arxiv.org/abs/2603.13271

作者:Minh Anh Nguyen,Kuheli Sai,Minh Nguyen

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:natural language processing, work traces, traces the evolution, evolution of word-embedding, NLP

备注

点击查看摘要

Abstract:This work traces the evolution of word-embedding techniques within the natural language processing (NLP) literature. We collect and analyze 149 research articles spanning the period from 1954 to 2025, providing both a comprehensive methodological review and a data-driven bibliometric analysis of how representation learning has developed over seven decades. Our study covers four major embedding paradigms, statistical representation-based methods (one-hot encoding, bag-of-words, TF-IDF), static word embeddings (Word2Vec, GloVe, FastText), contextual word embeddings (ELMo, BERT, GPT), and sentence/document embeddings, critically discussing the strengths, limitations, and intellectual lineage connecting each category. Beyond the methodological survey, we conduct a formal era comparison using GPT-3's release as a dividing line, applying seven hypothesis tests to quantify shifts in research focus, collaboration patterns, and institutional involvement. Our analysis reveals a dramatic post-GPT-3 paradigm shift: contextual and sentence-level methods now dominate at 6.4X the odds of the pre-GPT-3 era, mean team sizes have grown significantly (p = 0.018), and 30 entirely new techniques have emerged while 54 pre-GPT-3 methods received no further attention. These findings, combined with evidence of rising industry involvement, provide a quantitative account of how the field's epistemic priorities have been reshaped by the advent of large language models.

146. 【2603.13260】Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

链接https://arxiv.org/abs/2603.13260

作者:Minsang Kim,Seung Jun Baek

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Dual Knowledge Distillation, Knowledge Distillation, Distillation, costs to generate, student

备注: The Fourteenth International Conference on Learning Representations (ICLR) 2026, Accepted

点击查看摘要

Abstract:Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at this https URL.

147. 【2603.13259】How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

链接https://arxiv.org/abs/2603.13259

作者:Javier Marín

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fed a wrong, inside the network, wrong answer, cs.CL, correct

备注

点击查看摘要

Abstract:When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.13259 [cs.CL]

(or
arXiv:2603.13259v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.13259

Focus to learn more

              arXiv-issued DOI via DataCite</p>
148. 【2603.13256】raining-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems

链接https://arxiv.org/abs/2603.13256

作者:Mohammad Parsa Hosseini,Ankit Shah,Saiyra Qureshi,Alex Huang,Connie Miao,Wei Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)

关键词:large language model, high interaction cost, practical deployment remains, deployment remains hindered, Multi-agent large language

备注: under review, 13 pages

点击查看摘要

Abstract:Multi-agent large language model (LLM) systems enable complex, long-horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training-free controller for multi-agent LLM collaboration that improves routing efficiency during recursive delegation. REDEREF integrates (i) belief-guided delegation via Thompson sampling to prioritize agents with historically positive marginal contributions, (ii) reflection-driven re-routing using a calibrated LLM or programmatic judge, (iii) evidence-based selection rather than output averaging, and (iv) memory-aware priors to reduce cold-start inefficiency. Across multi-agent split-knowledge tasks, we show that while recursive retry alone saturates task success, belief-guided routing reduces token usage by 28%, agent calls by 17%, and time-to-success by 19% compared to random recursive delegation, and adapts gracefully under agent or judge degradation. These results demonstrate that simple, interpretable probabilistic control can meaningfully improve the efficiency and robustness of multi-agent LLM systems without training or fine-tuning.

149. 【2603.13249】Steering at the Source: Style Modulation Heads for Robust Persona Control

链接https://arxiv.org/abs/2603.13249

作者:Yoshihiro Izawa,Gouki Minegishi,Koshi Eguchi,Sosuke Hosokawa,Kenjiro Taura

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:controlling Large Language, Large Language Models, Large Language, Activation steering offers, computationally efficient mechanism

备注: 8 main pages with appendix

点击查看摘要

Abstract:Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

150. 【2603.13242】Automating the Analysis and Improvement of Dynamic Programming Algorithms with Applications to Natural Language Processing

链接https://arxiv.org/abs/2603.13242

作者:Tim Vieira

类目:Programming Languages (cs.PL); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

关键词:Toggle, natural language processing, Code, Toggle Hugging Face, Papers

备注: 2023 PhD dissertation (Johns Hopkins University)

点击查看摘要

Abstract:This thesis develops a system for automatically analyzing and improving dynamic programs, such as those that have driven progress in natural language processing and computer science, more generally, for decades. Finding a correct program with the optimal asymptotic runtime can be unintuitive, time-consuming, and error-prone. This thesis aims to automate this laborious process. To this end, we develop an approach based on 1. a high-level, domain-specific language called Dyna for concisely specifying dynamic programs 2. a general-purpose solver to efficiently execute these programs 3. a static analysis system that provides type analysis and worst-case time/space complexity analyses 4. a rich collection of meaning-preserving transformations to programs, which systematizes the repeated insights of numerous authors when speeding up algorithms in the literature 5. a search algorithm for identifying a good sequence of transformations that reduce the runtime complexity, given an initial, correct program We show that, in practice, automated search -- like the mental search performed by human programmers -- can find substantial improvements to the initial program. Empirically, we show that many speed-ups described in the NLP literature could have been discovered automatically by our system. We provide a freely available prototype system at this https URL.

Comments:
2023 PhD dissertation (Johns Hopkins University)

Subjects:

Programming Languages (cs.PL); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

Cite as:
arXiv:2603.13242 [cs.PL]

(or
arXiv:2603.13242v1 [cs.PL] for this version)

https://doi.org/10.48550/arXiv.2603.13242

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Tim Vieira [view email] [v1]
Fri, 20 Feb 2026 02:01:27 UTC (1,715 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Automating the Analysis and Improvement of Dynamic Programming Algorithms with Applications to Natural Language Processing, by Tim VieiraView PDFHTML (experimental)

view license

Current browse context: cs.PL

prev

|
next

new
|
recent
| 2026-03

Change to browse by:

cs
cs.CL
cs.FL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

151. 【2603.13240】Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field

链接https://arxiv.org/abs/2603.13240

作者:Ozge Mercanoglu Sincan,Jian He Low,Sobhan Asasi,Richard Bowden

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Sign Language Translation, convert visual sign, visual sign language, automatically convert visual, spoken language text

备注: This is a preprint of an article published in Computer Vision and Image Understanding (CVIU)

点击查看摘要

Abstract:Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (this https URL) to support transparency and reproducibility in SLT research.

Comments:
This is a preprint of an article published in Computer Vision and Image Understanding (CVIU)

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Cite as:
arXiv:2603.13240 [cs.CV]

(or
arXiv:2603.13240v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13240

Focus to learn more

              arXiv-issued DOI via DataCite

Journalreference:
Computer Vision and Image Understanding, vol. 261, p.104498, 2025

Related DOI:

https://doi.org/10.1016/j.cviu.2025.104498

Focus to learn more

            DOI(s) linking to related resources</p>
152. 【2603.13238】KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

链接https://arxiv.org/abs/2603.13238

作者:Henry Gagnier,Sophie Gagnier,Ashwin Kirubakaran

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Arabic script OCR, optical character recognition, Arabic script, OCR, Arabic

备注: Accepted to AbjadNLP @ EACL 2026

点击查看摘要

Abstract:Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.

153. 【2603.13231】ranslational Gaps in Graph Transformers for Longitudinal EHR Prediction: A Critical Appraisal of GT-BEHRT

链接https://arxiv.org/abs/2603.13231

作者:Krish Tadigotla

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:longitudinal electronic health, electronic health records, improved predictive modeling, large-scale self-supervised pretraining, Transformer-based models

备注: A critical review of graph transformer models for longitudinal electronic health records, discussing evaluation practices, calibration, fairness, and clinical relevance. 5 pages

点击查看摘要

Abstract:Transformer-based models have improved predictive modeling on longitudinal electronic health records through large-scale self-supervised pretraining. However, most EHR transformer architectures treat each clinical encounter as an unordered collection of codes, which limits their ability to capture meaningful relationships within a visit. Graph-transformer approaches aim to address this limitation by modeling visit-level structure while retaining the ability to learn long-term temporal patterns. This paper provides a critical review of GT-BEHRT, a graph-transformer architecture evaluated on MIMIC-IV intensive care outcomes and heart failure prediction in the All of Us Research Program. We examine whether the reported performance gains reflect genuine architectural benefits and whether the evaluation methodology supports claims of robustness and clinical relevance. We analyze GT-BEHRT across seven dimensions relevant to modern machine learning systems, including representation design, pretraining strategy, cohort construction transparency, evaluation beyond discrimination, fairness assessment, reproducibility, and deployment feasibility. GT-BEHRT reports strong discrimination for heart failure prediction within 365 days, with AUROC 94.37 +/- 0.20, AUPRC 73.96 +/- 0.83, and F1 64.70 +/- 0.85. Despite these results, we identify several important gaps, including the lack of calibration analysis, incomplete fairness evaluation, sensitivity to cohort selection, limited analysis across phenotypes and prediction horizons, and limited discussion of practical deployment considerations. Overall, GT-BEHRT represents a meaningful architectural advance in EHR representation learning, but more rigorous evaluation focused on calibration, fairness, and deployment is needed before such models can reliably support clinical decision-making.

154. 【2603.13230】Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting

链接https://arxiv.org/abs/2603.13230

作者:Jinghan Cao,Qingyang Ren,Xiangyun Chen,Xinjin Li,Haoxiang Gao,Yu Zhao

类目:Computation and Language (cs.CL)

关键词:challenging downstream task, embedded in contextual, challenging downstream, downstream task, expressions are inherently

备注

点击查看摘要

Abstract:Slang interpretation has been a challenging downstream task for Large Language Models (LLMs) as the expressions are inherently embedded in contextual, cultural, and linguistic frameworks. In the absence of domain-specific training data, it is difficult for LLMs to accurately interpret slang meaning based on lexical information. This paper attempts to investigate the challenges of slang inference using large LLMs and presents a greedy search-guided chain-of-thought framework for slang interpretation. Through our experiments, we conclude that the model size and temperature settings have limited impact on inference accuracy. Transformer-based models with larger active parameters do not generate higher accuracy than smaller models. Based on the results of the above empirical study, we integrate greedy search algorithms with chain-of-thought prompting for small language models to build a framework that improves the accuracy of slang interpretation. The experimental results indicate that our proposed framework demonstrates improved accuracy in slang meaning interpretation. These findings contribute to the understanding of context dependency in language models and provide a practical solution for enhancing slang comprehension through a structured reasoning prompting framework.

155. 【2603.11863】CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

链接https://arxiv.org/abs/2603.11863

作者:Zi-Han Wang,Lam Nguyen,Zhengyang Zhao,Mengyue Yang,Chengwei Qin,Yujiu Yang,Linyi Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:high-quality pre-training data, shifted research focus, generating novel artifacts, success of AlphaEvolve, evolutionary systems capable

备注

点击查看摘要

Abstract:The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

156. 【2603.10165】OpenClaw-RL: Train Any Agent Simply by Talking

链接https://arxiv.org/abs/2603.10165

作者:Yinjie Wang,Xuyang Chen,Xiaolong Jin,Mengdi Wang,Ling Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:online learning source, GUI state change, tool output, online learning, learning source

备注: Code: [this https URL](https://github.com/Gen-Verse/OpenClaw-RL)

点击查看摘要

Abstract:Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: this https URL

157. 【2603.14889】Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

链接https://arxiv.org/abs/2603.14889

作者:Jingyu Lu,Yuhan Wang,Fan Zhuo,Xize Cheng,Changhao Pan,Xueyi Pu,Yifu Chen,Chenyuhao Wen,Tianle Liang,Zhou Zhao

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:spoken dialogue systems, dialogue systems demands, systems demands transcending, demands transcending mere, transcending mere textual

备注

点击查看摘要

Abstract:The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at this https URL.

158. 【2603.14732】Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

链接https://arxiv.org/abs/2603.14732

作者:Will Yeadon,Tom Hardy,Paul Mackay,Elise Agra

类目:Physics Education (physics.ed-ph); Computation and Language (cs.CL)

关键词:large language models, rho, trusted is essential, Claude Opus, Gemini Pro

备注: 25 pages, 26 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking can be trusted is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and exemplar-anchored conditions. For $n=771$ blind university exam questions, models achieve fractional mean absolute errors (fMAE) $\approx 0.22$ with robust discriminative validity (Spearman $\rho 0.6$). For secondary and university structured questions ($n=1151$), providing official solutions reduces MAE and strengthens validity (committee $\rho = 0.88$); false solutions degrade absolute accuracy but leave rank ordering largely intact (committee $\rho = 0.77$; individual models $\rho \geq 0.59$). Essay marking behaves fundamentally differently. Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor ($\rho \approx 0.1$). Adding a mark scheme does not improve discrimination ($\rho \approx 0$; all confidence intervals include zero). Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but discriminative validity remains near-zero - distributional agreement can occur without valid discrimination. For code-based plot elements ($n=1400$), models achieve exceptionally high discriminative validity ($\rho 0.84$) with near-linear calibration. Across all task types, validity tracks criterion-referenceability - the extent to which a task maps to explicit, observable grading features - and benchmark reliability, rather than raw model capability.

159. 【2603.13558】Holographic Invariant Storage: Design-Time Safety Contracts via Vector Symbolic Architectures

链接https://arxiv.org/abs/2603.13558

作者:Arsenios Scrivens

类目:Machine Learning (stat.ML); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)

关键词:Holographic Invariant Storage, Vector Symbolic Architectures, introduce Holographic Invariant, bipolar Vector Symbolic, Invariant Storage

备注: 25 pages, 7 figures, includes appendices with extended proofs and pilot LLM experiment

点击查看摘要

Abstract:We introduce Holographic Invariant Storage (HIS), a protocol that assembles known properties of bipolar Vector Symbolic Architectures into a design-time safety contract for LLM context-drift mitigation. The contract provides three closed-form guarantees evaluable before deployment: single-signal recovery fidelity converging to $1/\sqrt{2} \approx 0.707$ (regardless of noise depth or content), continuous-noise robustness $2\Phi(1/\sigma) - 1$, and multi-signal capacity degradation $\approx\sqrt{1/(K+1)}$. These bounds, validated by Monte Carlo simulation ($n = 1{,}000$), enable a systems engineer to budget recovery fidelity and codebook capacity at design time -- a property no timer or embedding-distance metric provides. A pilot behavioral experiment (four LLMs, 2B--7B, 720 trials) confirms that safety re-injection improves adherence at the 2B scale; full results are in an appendix.

160. 【2603.13518】VoXtream2: Full-stream TTS with dynamic speaking rate control

链接https://arxiv.org/abs/2603.13518

作者:Nikita Torgashov,Gustav Eje Henter,Gabriel Skantze

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)

关键词:text arrives incrementally, arrives incrementally, zero-shot full-stream TTS, full-stream TTS model, interactive systems

备注: 10 pages, 9 figures, Submitted to Interspeech 2026

点击查看摘要

Abstract:Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

信息检索

1. 【2603.15459】Financial Transaction Retrieval and Contextual Evidence for Knowledge-Grounded Reasoning

链接https://arxiv.org/abs/2603.15459

作者:Artem Sakhno,Daniil Tomilov,Yuliana Shakhvalieva,Inessa Fedorova,Daria Ruzanova,Omar Zoloev,Andrey Savchenko,Maksim Makarenko

类目:Information Retrieval (cs.IR)

关键词:user modeling pipelines, financial organizations heavily, process digital traces, digital traces generated, organizations heavily depends

备注

点击查看摘要

Abstract:Nowadays, success of financial organizations heavily depends on their ability to process digital traces generated by their clients, e.g., transaction histories, gathered from various sources to improve user modeling pipelines. As general-purpose LLMs struggle with time-distributed tabular data, production stacks still depend on specialized tabular and sequence models with limited transferability and need for labeled data. To address this, we introduce FinTRACE, a retrieval-first architecture that converts raw transactions into reusable feature representations, applies rule-based detectors, and stores the resulting signals in a behavioral knowledge base with graded associations to the objectives of downstream tasks. Across public and industrial benchmarks, FinTRACE substantially improves low-supervision transaction analytics, doubling zero-shot MCC on churn prediction performance from 0.19 to 0.38 and improving 16-shot MCC from 0.25 to 0.40. We further use FinTRACE to ground LLMs via instruction tuning on retrieved behavioral patterns, achieving state-of-the-art LLM results on transaction analytics problems.

2. 【2603.15357】Multi-Scenario User Profile Construction via Recommendation Lists

链接https://arxiv.org/abs/2603.15357

作者:Hui Zhang,Jiayu Liu

类目:Information Retrieval (cs.IR)

关键词:including business analytics, Recommender systems, play a core, including business, business analytics

备注

点击查看摘要

Abstract:Recommender systems (RS) play a core role in various domains, including business analytics, helping users and companies make appropriate decisions. To optimize service quality, related technologies focus on constructing user profiles by analyzing users' historical behavior information. This paper considers four analytical scenarios to evaluate user profiling capabilities under different information conditions. A generic user attribute analysis framework named RAPI is proposed, which infers users' personal characteristics by exploiting easily accessible recommendation lists. Specifically, a surrogate recommendation model is established to simulate the original model, leveraging content embedding from a pre-trained BERT model to obtain item embeddings. A sample augmentation module generates extended recommendation lists by considering similarity between model outputs and item embeddings. Finally, an adaptive weight classification model assigns dynamic weights to facilitate user characteristic inference. Experiments on four collections show that RAPI achieves inference accuracy of 0.764 and 0.6477, respectively.

3. 【2603.14997】OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora

链接https://arxiv.org/abs/2603.14997

作者:Jeffrey Flynt

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Evaluating retrieval-augmented generation, rarely provide cleanly, pipelines requires corpora, real-world datasets rarely, datasets rarely provide

备注

点击查看摘要

Abstract:Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across this http URL present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.

4. 【2603.14828】Mitigating KG Quality Issues: A Robust Multi-Hop GraphRAG Retrieval Framework

链接https://arxiv.org/abs/2603.14828

作者:Yizhuo Ma,Shuang Liang,Rongzheng Wang,Jiakai,Qizhi Chen,Muquan Li,Ke Qin

类目:Information Retrieval (cs.IR)

关键词:Graph Retrieval-Augmented Generation, imperfect knowledge graphs, inherent quality issues, Retrieval-Augmented Generation enhances, knowledge graphs

备注

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation enhances multi-hop reasoning but relies on imperfect knowledge graphs that frequently suffer from inherent quality issues. Current approaches often overlook these issues, consequently struggling with retrieval drift driven by spurious noise and retrieval hallucinations stemming from incomplete information. To address these challenges, we propose C2RAG (Constraint-Checked Retrieval-Augmented Generation), a framework aimed at robust multi-hop retrieval over the imperfect KG. First, C2RAG performs constraint-based retrieval by decomposing each query into atomic constraint triples, with using fine-grained constraint anchoring to filter candidates for suppressing retrieval drift. Second, C2RAG introduces a sufficiency check to explicitly prevent retrieval hallucinations by deciding whether the current evidence is sufficient to justify structural propagation, and activating textual recovery otherwise. Extensive experiments on multi-hop benchmarks demonstrate that C2RAG consistently outperforms the latest baselines by 3.4\% EM and 3.9\% F1 on average, while exhibiting improved robustness under KG issues.

5. 【2603.14635】Compute Allocation for Reasoning-Intensive Retrieval Agents

链接https://arxiv.org/abs/2603.14635

作者:Sreeja Apparaju,Nilesh Gupta

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:stores grow continuously, memory stores grow, accessing relevant information, making retrieval critical, long horizons

备注

点击查看摘要

Abstract:As agents operate over long horizons, their memory stores grow continuously, making retrieval critical to accessing relevant information. Many agent queries require reasoning-intensive retrieval, where the connection between query and relevant documents is implicit and requires inference to bridge. LLM-augmented pipelines address this through query expansion and candidate re-ranking, but introduce significant inference costs. We study computation allocation in reasoning-intensive retrieval pipelines using the BRIGHT benchmark and Gemini 2.5 model family. We vary model capacity, inference-time thinking, and re-ranking depth across query expansion and re-ranking stages. We find that re-ranking benefits substantially from stronger models (+7.5 NDCG@10) and deeper candidate pools (+21% from $k$=10 to 100), while query expansion shows diminishing returns beyond lightweight models (+1.1 NDCG@10 from weak to strong). Inference-time thinking provides minimal improvement at either stage. These results suggest that compute should be concentrated on re-ranking rather than distributed uniformly across pipeline stages.

6. 【2603.14629】ResearchPilot: A Local-First Multi-Agent System for Literature Synthesis and Related Work Drafting

链接https://arxiv.org/abs/2603.14629

作者:Peng Zhang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:self-hostable multi-agent system, self-hostable multi-agent, literature-review assistance, Semantic Scholar, multi-agent system

备注

点击查看摘要

Abstract:ResearchPilot is an open-source, self-hostable multi-agent system for literature-review assistance. Given a natural-language research question, it retrieves papers from Semantic Scholar and arXiv, extracts structured findings from paper abstracts, synthesizes cross-paper patterns, and drafts a citation-aware related-work section. The system combines FastAPI, this http URL, DSPy, SQLite, and Qdrant in a local-first architecture that supports bring-your-own-key model access and remote-or-local embeddings. This paper describes the system design, typed agent interfaces, persistence and history-search mechanisms, and the engineering tradeoffs involved in building a transparent research assistant. Rather than claiming algorithmic novelty, we present ResearchPilot as a systems contribution and evaluate it through automated tests and end-to-end local runs. We discuss limitations including external API rate limits, abstract-only extraction, incomplete corpus coverage, and the lack of citation verification.

7. 【2603.14591】FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

链接https://arxiv.org/abs/2603.14591

作者:Wilhelm Tranheden,Shahnawaz Ahmed,Devdatt Dubhashi,Jonna Matthiesen,Hannes von Essen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:increasingly adopting smaller, adopting smaller architectures, smaller architectures optimized, increasingly adopting, architectures optimized

备注: A collection of models with FlashHead optimization can be found at: [this https URL](https://huggingface.co/collections/embedl/flashhead)

点击查看摘要

Abstract:Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60\% of model parameters, and 50\% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, reframing that computation at the output head as a retrieval problem rather than a dense classification over the full vocabulary. FlashHead introduces four key innovations: (1) a balanced clustering scheme that structures vocabulary partitions into compact hardware-efficient tensors, (2) extending multiprobe retrieval to language model heads, enabling thousands of clusters to be scored in parallel, (3) a novel inference-time sampling mechanism that extends retrieval beyond top tokens, enabling probabilistic sampling across the full vocabulary, and (4) selective quantization, enabling effective low-bit computation in the head. Experiments on Llama-3.2, Gemma-3, and Qwen-3 show that FlashHead delivers model-level inference speedups of up to \textbf{1.75x} which maintaining output accuracy compared to the original head. By overcoming the classification head bottleneck, FlashHead establishes a new benchmark for efficient inference and removes a key barrier to developing smaller, capable models for consumer hardware.

8. 【2603.14588】SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory

链接https://arxiv.org/abs/2603.14588

作者:Varun Pratap Bhardwaj

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:consistency remain unexplored, remain unexplored, Persistent memory, central capability, consistency remain

备注: 43 pages, 5 figures, 9 tables, 3 appendices. Code: [this https URL](https://github.com/qualixar/superlocalmemory) . Zenodo DOI: [https://doi.org/10.5281/zenodo.19038659](https://doi.org/10.5281/zenodo.19038659)

点击查看摘要

Abstract:Persistent memory is a central capability for AI agents, yet the mathematical foundations of memory retrieval, lifecycle management, and consistency remain unexplored. Current systems employ cosine similarity for retrieval, heuristic decay for salience, and provide no formal contradiction detection. We establish information-geometric foundations through three contributions. First, a retrieval metric derived from the Fisher information structure of diagonal Gaussian families, satisfying Riemannian metric axioms, invariant under sufficient statistics, and computable in O(d) time. Second, memory lifecycle formulated as Riemannian Langevin dynamics with proven existence and uniqueness of the stationary distribution via the Fokker-Planck equation, replacing hand-tuned decay with principled convergence guarantees. Third, a cellular sheaf model where non-trivial first cohomology classes correspond precisely to irreconcilable contradictions across memory contexts. On the LoCoMo benchmark, the mathematical layers yield +12.7 percentage points over engineering baselines across six conversations, reaching +19.9 pp on the most challenging dialogues. A four-channel retrieval architecture achieves 75% accuracy without cloud dependency. Cloud-augmented results reach 87.7%. A zero-LLM configuration satisfies EU AI Act data sovereignty requirements by architectural design. To our knowledge, this is the first work establishing information-geometric, sheaf-theoretic, and stochastic-dynamical foundations for AI agent memory systems.

Comments:
43 pages, 5 figures, 9 tables, 3 appendices. Code: this https URL. Zenodo DOI: https://doi.org/10.5281/zenodo.19038659

Subjects:

Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

ACMclasses:
I.2.6; H.3.3

Cite as:
arXiv:2603.14588 [cs.AI]

(or
arXiv:2603.14588v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.14588

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.5281/zenodo.19038659

Focus to learn more

            DOI(s) linking to related resources</p>
9. 【2603.14584】Open, to What End? A Capability-Theoretic Perspective on Open Search

链接https://arxiv.org/abs/2603.14584

作者:Nicola Neophytou,Bhaskar Mitra

类目:Information Retrieval (cs.IR); Computers and Society (cs.CY)

关键词:raises justifiable concerns, manipulate public opinion, large corporations raises, corporations raises justifiable, emerging geopolitical tensions

备注

点击查看摘要

Abstract:The hegemony of control over our search platforms by a few large corporations raises justifiable concerns, particularly in light of emerging geopolitical tensions and growing instances of ideological imposition by authoritarian actors to manipulate public opinion. Recent movement for promote open search has emerged in response. This follows from past and ongoing push for openness to challenge corporate oligopolies (e.g., open source and open AI models) which have seen significant ongoing negotiations and renegotiations to establish standards around what constitutes being open. These tensions have hindered these movements from effectively challenging power, in turn allowing powerful corporations to neutralize or co-opt these movements to further entrench their dominance. We argue that the push for open search will inevitably encounter similar conflicts, and should foreground these tensions to safefguard against similar challenges as these adjacent movements. In particular, we argue that the concept of open should be understood not with respect to what is being made open but through a capability-theoretic lens, in terms of the capabilities it affords to the actors the system is being opened to.

10. 【2603.14559】A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy

链接https://arxiv.org/abs/2603.14559

作者:Noha Ghatwary,Jiangbei Yue,Ahmed Elgendy,Hanna Nagdy,Ahmed Galal,Hayam Fathy,Hussein El-Amin,Venkataraman Subramanian,Noor Mohammed,Gilberto Ochoa-Ruiz,Sharib Ali

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Ulcerative Colitis Endoscopic, Ulcerative colitis, chronic mucosal inflammatory, mucosal inflammatory condition, Colitis Endoscopic Index

备注: 11

点击查看摘要

Abstract:Ulcerative colitis (UC) is a chronic mucosal inflammatory condition that places patients at increased risk of colorectal cancer. Colonoscopic surveillance remains the gold standard for assessing disease activity, and reporting typically relies on standardised endoscopic scoring metrics. The most widely used is the Mayo Endoscopic Score (MES), with some centres also adopting the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Both are descriptive assessments of mucosal inflammation (MES: 0 to 3; UCEIS: 0 to 8), where higher values indicate more severe disease. However, computational methods for automatically predicting these scores remain limited, largely due to the lack of publicly available expert-annotated datasets and the absence of robust benchmarking. There is also a significant research gap in generating clinically meaningful descriptions of UC images, despite image captioning being a well-established computer vision task. Variability in endoscopic systems and procedural workflows across centres further highlights the need for multi-centre datasets to ensure algorithmic robustness and generalisability. In this work, we introduce a curated multi-centre, multi-resolution dataset that includes expert-validated MES and UCEIS labels, alongside detailed clinical descriptions. To our knowledge, this is the first comprehensive dataset that combines dual scoring metrics for classification tasks with expert-generated captions describing mucosal appearance and clinically accepted reasoning for image captioning. This resource opens new opportunities for developing clinically meaningful multimodal algorithms. In addition to the dataset, we also provide benchmarking using convolutional neural networks, vision transformers, hybrid models, and widely used multimodal vision-language captioning algorithms.

11. 【2603.14541】Expert Mind: A Retrieval-Augmented Architecture for Expert Knowledge Preservation in the Energy Sector

链接https://arxiv.org/abs/2603.14541

作者:Diego Ezequiel Cervera

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:conventional documentation practices, industrial organizations results, proposes Expert Mind, documentation practices, Expert Mind

备注: 6 pages, 1 figure, conceptual architecture paper on retrieval-augmented expert knowledge systems

点击查看摘要

Abstract:The departure of subject-matter experts from industrial organizations results in the irreversible loss of tacit knowledge that is rarely captured through conventional documentation practices. This paper proposes Expert Mind, an experimental system that leverages Retrieval-Augmented Generation (RAG), large language models (LLMs), and multimodal capture techniques to preserve, structure, and make queryable the deep expertise of organizational knowledge holders. Drawing on the specific context of the energy sector, where decades of operational experience risk being lost to an aging workforce, we describe the system architecture, processing pipeline, ethical framework, and evaluation methodology. The proposed system addresses the knowledge elicitation problem through structured interviews, think-aloud sessions, and text corpus ingestion, which are subsequently embedded into a vector store and queried through a conversational interface. Preliminary design considerations suggest Expert Mind can significantly reduce knowledge transfer latency and improve onboarding efficiency. Ethical dimensions including informed consent, intellectual property, and the right to erasure are addressed as first-class design constraints.

12. 【2603.14468】LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

链接https://arxiv.org/abs/2603.14468

作者:Rongyi Yu,Chenyuan Duan,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:increasingly relies, retrieval, long videos, video question answering, retrieval planning

备注: 12 pages, 2 figures, appendix included

点击查看摘要

Abstract:Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

13. 【2603.14458】Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

链接https://arxiv.org/abs/2603.14458

作者:Auksarapak Kietkajornrit,Jad Tarifi,Nima Asgharbeygi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Fact-seeking question answering, large language models, remains unreliable, conflicting information, question answering

备注

点击查看摘要

Abstract:Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.

14. 【2603.14426】GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

链接https://arxiv.org/abs/2603.14426

作者:Minghan Li,Tongna Chen,Tianrui Lv,Yishuai Zhang,Suchao An,Guodong Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:leaving temporal reasoning, end-state grounding under-evaluated, single frame, temporal hard negative, dominated by real-world

备注

点击查看摘要

Abstract:Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on this http URL.

15. 【2603.14422】MBD: A Model-Based Debiasing Framework Across User, Content, and Model Dimensions

链接https://arxiv.org/abs/2603.14422

作者:Yuantong Li,Lei Yuan,Zhihao Zheng,Weimiao Wu,Songbin Liu,Jeong Min Lee,Ali Selman Aydin,Shaofeng Deng,Junbo Chen,Xinyi Zhang,Hongjing Xia,Sam Fieldman,Matthew Kosko,Wei Fu,Du Zhang,Peiyu Yang,Albert Jin Chung,Xianlei Qiu,Miao Yu,Zhongwei Teng,Hao Chen,Sunny Baek,Hui Tang,Yang Lv,Renze Wang,Qifan Wang,Zhan Li,Tiantian Xu,Peng Wu,Ji Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Modern recommendation systems, Modern recommendation, aggregating multiple behavioral, recommendation systems rank, systems rank candidates

备注

点击查看摘要

Abstract:Modern recommendation systems rank candidates by aggregating multiple behavioral signals through a value model. However, many commonly used signals are inherently affected by heterogeneous biases. For example, watch time naturally favors long-form content, loop rate favors short - form content, and comment probability favors videos over images. Such biases introduce two critical issues: (1) value model scores may be systematically misaligned with users' relative preferences - for instance, a seemingly low absolute like probability may represent exceptionally strong interest for a user who rarely engages; and (2) changes in value modeling rules can trigger abrupt and undesirable ecosystem shifts. In this work, we ask a fundamental question: can biased behavioral signals be systematically transformed into unbiased signals, under a user - defined notion of ``unbiasedness'', that are both personalized and adaptive? We propose a general, model-based debiasing (MBD) framework that addresses this challenge by augmenting it with distributional modeling. By conditioning on a flexible subset of features (partial feature set), we explicitly estimate the contextual mean and variance of the engagement distribution for arbitrary cohorts (e.g., specific video lengths or user regions) directly alongside the main prediction. This integration allows the framework to convert biased raw signals into unbiased representations, enabling the construction of higher-level, calibrated signals (such as percentiles or z - scores) suitable for the value model. Importantly, the definition of unbiasedness is flexible and controllable, allowing the system to adapt to different personalization objectives and modeling preferences. Crucially, this is implemented as a lightweight, built-in branch of the existing MTML ranking model, requiring no separate serving infrastructure.

16. 【2603.14374】A Systematic Comparison and Evaluation of Building Ontologies for Deploying Data-Driven Analytics in Smart Buildings

链接https://arxiv.org/abs/2603.14374

作者:Zhangcheng Qiang,Stuart Hands,Kerry Taylor,Subbu Sethuvenkatraman,Daniel Hugo,Pouya Ghiasnezhad Omran,Madhawa Perera,Armin Haller

类目:Information Retrieval (cs.IR); Systems and Control (eess.SY)

关键词:diverse smart building, smart building applications, Brick Schema, building ontologies, Google Digital Buildings

备注: 32 pages

点击查看摘要

Abstract:Ontologies play a critical role in data exchange, information integration, and knowledge sharing across diverse smart building applications. Yet, semantic differences between the prevailing building ontologies hamper their purpose of bringing data interoperability and restrict the ability to reuse building ontologies in real-world applications. In this paper, we propose and adopt a framework to conduct a systematic comparison and evaluation of four popular building ontologies (Brick Schema, RealEstateCore, Project Haystack and Google's Digital Buildings) from both axiomatic design and assertions in a use case, namely the Terminological Box (TBox) evaluation and the Assertion Box (ABox) evaluation. In the TBox evaluation, we use the SQuaRE-based Ontology Quality Evaluation (OQuaRE) Framework and concede that Project Haystack and Brick Schema are more compact with respect to the ontology axiomatic design. In the ABox evaluation, we apply an empirical study with sample building data that suggests that Brick Schema and RealEstateCore have greater completeness and expressiveness in capturing the main concepts and relations within the building domain. The results implicitly indicate that there is no universal building ontology for integrating Linked Building Data (LBD). We also discuss ontology compatibility and investigate building ontology design patterns (ODPs) to support ontology matching, alignment, and harmonisation.

17. 【2603.14349】Learning Image-Text Matching with Optimal Partial Transport

链接https://arxiv.org/abs/2603.14349

作者:Zhengxin Pan,Haishuai Wang,Fangyu Wu,Bailing Zhang,Jiajun Bu,Hongyang Chen

类目:Information Retrieval (cs.IR)

关键词:substantial research interest, recently garnered substantial, garnered substantial research, vision and language, research interest

备注: accepted by ICASSP2025

点击查看摘要

Abstract:Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging local-to-global similarities to eliminate the interference of irrelevant fragments. We conduct extensive evaluations of OMIT on two benchmark image-text retrieval datasets, namely Flickr30K and MS-COCO. The superior performance achieved by OMIT on both datasets unequivocally demonstrates its effectiveness in cross-modal matching. Furthermore, through comprehensive visualization analysis, we elucidate OMIT's inherent tendency towards focal matching, thereby shedding light on its efficacy. Our code is publicly available at this https URL.

18. 【2603.14259】Bringing Model Editing to Generative Recommendation in Cold-Start Scenarios

链接https://arxiv.org/abs/2603.14259

作者:Chenglei Shen,Teng Shi,Weijie Yu,Xiao Zhang,Jun Xu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:shown strong potential, shown strong, strong potential, potential for sequential, recommendation

备注

点击查看摘要

Abstract:Generative recommendation (GR) has shown strong potential for sequential recommendation in an end-to-end generation paradigm. However, existing GR models suffer from severe cold-start collapse: their recommendation accuracy on cold-start items can drop to near zero. Current solutions typically rely on retraining with cold-start interactions, which is hindered by sparse feedback, high computational cost, and delayed updates, limiting practical utility in rapidly evolving recommendation catalogs. Inspired by model editing in NLP, which enables training-free knowledge injection into large language models, we explore how to bring this paradigm to generative recommendation. This, however, faces two key challenges: GR lacks the explicit subject-object binding common in natural language, making targeted edits difficult; and GR does not exhibit stable token co-occurrence patterns, making the injection of multi-token item representations unreliable. To address these challenges, we propose GenRecEdit, a model editing framework tailored for generative recommendation. GenRecEdit explicitly models the relationship between the full sequence context and next-token generation, adopts iterative token-level editing to inject multi-token item representations, and introduces a one-to-one trigger mechanism to reduce interference among multiple edits during inference. Extensive experiments on multiple datasets show that GenRecEdit substantially improves recommendation performance on cold-start items while preserving the model's original recommendation quality. Moreover, it achieves these gains using only about 9.5% of the training time required for retraining, enabling more efficient and frequent model updates.

19. 【2603.14173】Hybrid Intent-Aware Personalization with Machine Learning and RAG-Enabled Large Language Models for Financial Services Marketing

链接https://arxiv.org/abs/2603.14173

作者:Akhil Chandra Shanivendra

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Personalized marketing, predict customer behavior, generate compliant, services requires models, context-appropriate content

备注: 18 pages, 5 figures, 3 tables. Applied ML systems paper. The contribution is architectural rather than algorithmic

点击查看摘要

Abstract:Personalized marketing in financial services requires models that can both predict customer behavior and generate compliant, context-appropriate content. This paper presents a hybrid architecture that integrates classical machine learning for segmentation, latent intent modeling, and personalization prediction with retrieval-augmented large language models for grounded content generation. A synthetic, reproducible dataset is constructed to reflect temporal customer behavior, product interactions, and marketing responses. The proposed framework incorporates temporal encoders, latent representations, and multi-task classification to estimate segment membership, customer intent, and product-channel recommendations. A retrieval-augmented generation layer then produces customer-facing messages constrained by retrieved domain documents. Experiments show that temporal modeling and intent features improve personalization accuracy, while citation-based retrieval reduces unsupported generation and supports auditability in regulated settings. The contribution is primarily architectural, demonstrating how predictive modeling and RAG-based generation can be combined into a transparent, explainable pipeline for financial services personalization.

20. 【2603.14170】Citation-Enforced RAG for Fiscal Document Intelligence: Cited, Explainable Knowledge Retrieval in Tax Compliance

链接https://arxiv.org/abs/2603.14170

作者:Akhil Chandra Shanivendra

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:public-sector financial agencies, financial agencies rely, including tax forms, semi-structured fiscal documents, jurisdiction-specific guidance

备注: 22 pages, 3 figures. Applied AI systems paper focused on citation-enforced RAG and abstention for fiscal document intelligence

点击查看摘要

Abstract:Tax authorities and public-sector financial agencies rely on large volumes of unstructured and semi-structured fiscal documents - including tax forms, instructions, publications, and jurisdiction-specific guidance - to support compliance analysis and audit workflows. While recent advances in generative AI and retrieval-augmented generation (RAG) have shown promise for document-centric question answering, existing approaches often lack the transparency, citation fidelity, and conservative behaviour required in high-stakes regulatory domains. This paper presents a multimodal, citation-enforced RAG framework for fiscal document intelligence that prioritises explainability and auditability. The framework adopts a source-first ingestion strategy, preserves page-level provenance, enforces citations during generation, and supports abstention when evidence is insufficient. Evaluation on real IRS and state tax documents demonstrates improved citation fidelity, reduced hallucination, and analyst-usable explanations, illustrating a pathway toward trustworthy AI for tax compliance.

21. 【2603.14045】he Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA

链接https://arxiv.org/abs/2603.14045

作者:Yasaman Zarinkia,Venkatesh Srinivasan,Alex Thomo

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:systems achieve strong, achieve strong multi-hop, guarantee strong answers, Graph-RAG systems achieve, knowledge graphs

备注: 11 pages, 2 figures, 9 tables; under review

点击查看摘要

Abstract:Graph-RAG systems achieve strong multi-hop question answering by indexing documents into knowledge graphs, but strong retrieval does not guarantee strong answers. Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and 73% to 84% of errors are reasoning failures. We propose two augmentations: (i) SPARQL chain-of-thought prompting, which decomposes questions into triple-pattern queries aligned with the entity-relationship context, and (ii) graph-walk compression, which compresses the context by ~60% via knowledge-graph traversal with no LLM calls. SPARQL CoT improves accuracy by +2 to +14 pp; graph-walk compression adds +6 pp on average when paired with structured prompting on smaller models. Surprisingly, we show that, with question-type routing, a fully augmented budget open-weight Llama-8B model matches or exceeds the unaugmented Llama-70B baseline on all three benchmarks at ~12x lower cost. A replication on LightRAG confirms that our augmentations transfer across Graph-RAG systems.

22. 【2603.13997】Location Aware Embedding for Geotargeting in Sponsored Search Advertising

链接https://arxiv.org/abs/2603.13997

作者:Jelena Gligorijevic,Djordje Gligorijevic,Aravindan Raghuveer,Mihajlo Grbovic,Zoran Obradovic

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Web search, monetizing web search, everyday life, web search query, inevitable part

备注

点击查看摘要

Abstract:Web search has become an inevitable part of everyday life. Improving and monetizing web search has been a focus of major Internet players. Understanding the context of web search query is an important aspect of this task as it represents unobserved facts that add meaning to an otherwise incomplete this http URL context of a query consists of user's location, local time, search history, behavioral segments, installed apps on their phone and so on. Queries that either explicitly use location context (eg: "best hotels in New York City") or implicitly refer to the user's physical location (e.g. "coffee shops near me") are becoming increasingly common on mobile devices. Understanding and representing the user's interest location and/or physical location is essential for providing a relevant user experience. In this study, we developed a simple and powerful neural embedding based framework to represent a user's query and their location in a single low-dimensional space. We show that this representation is able to capture the subtle interactions between the user's query intent and query/physical location, while improving the ad ranking and query-ad relevance scores over other location-unaware approaches and location-aware approaches.

23. 【2603.13934】Iterative Semantic Reasoning from Individual to Group Interests for Generative Recommendation with LLMs

链接https://arxiv.org/abs/2603.13934

作者:Xiaofei Zhu,Jinfei Chen,Feiyang Yuan,Zhou Yang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Recommendation systems aim, deliver relevant items, interests, user, semantic

备注: Accepted at The Web Conference (WWW) 2026

点击查看摘要

Abstract:Recommendation systems aim to learn user interests from historical behaviors and deliver relevant items. Recent methods leverage large language models (LLMs) to construct and integrate semantic representations of users and items for capturing user interests. However, user behavior theories suggest that truly understanding user interests requires not only semantic integration but also semantic reasoning from explicit individual interests to implicit group interests. To this end, we propose an Iterative Semantic Reasoning Framework (ISRF) for generative recommendation. ISRF leverages LLMs to bridge explicit individual interests and implicit group interests in three steps. First, we perform multi-step bidirectional reasoning over item attributes to infer semantic item features and build a semantic interaction graph capturing users' explicit interests. Second, we generate semantic user features based on the semantic item features and construct a similarity-based user graph to infer the implicit interests of similar user groups. Third, we adopt an iterative batch optimization strategy, where individual explicit interests directly guide the refinement of group implicit interests, while group implicit interests indirectly enhance individual modeling. This iterative process ensures consistent and progressive interest reasoning, enabling more accurate and comprehensive user interest learning. Extensive experiments on the Sports, Beauty, and Toys datasets demonstrate that ISRF outperforms state-of-the-art baselines. The code is available at this https URL.

24. 【2603.13776】Retrieval-Feedback-Driven Distillation and Preference Alignment for Efficient LLM-based Query Expansion

链接https://arxiv.org/abs/2603.13776

作者:Minghan Li,Guodong Zhou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, makes direct deployment, direct deployment difficult, practical retrieval systems

备注: 25 pages

点击查看摘要

Abstract:Large language models have recently enabled a generative paradigm for query expansion, but their high inference cost makes direct deployment difficult in practical retrieval systems. To address this issue, a retrieval-feedback-driven distillation and preference-alignment framework is proposed to transfer retrieval-friendly expansion behavior from a strong teacher model to a compact student model. Rather than relying on few-shot exemplars at inference time, the framework first leverages two complementary types of teacher-generated expansions, produced under zero-shot and few-shot prompting conditions, as supervision signals for distillation and as candidate pools for preference construction. A retrieval-metric-driven strategy is then introduced to automatically form chosen/rejected expansion pairs according to nDCG@10 differences, and Direct Preference Optimization is applied to explicitly align generation preferences with retrieval objectives. Experiments on TREC DL19/20/21 and MIRACL-zh show that the proposed approach preserves strong retrieval effectiveness while substantially reducing inference cost. In particular, the distilled Qwen3-4B model reaches about 97% of the teacher (DeepSeek-685B) model's nDCG@10 performance on DL19, and remains effective on the Chinese MIRACL-zh benchmark, demonstrating strong practicality across both English and Chinese retrieval settings.

25. 【2603.13772】GreCon3: Mitigating High Resource Utilization of GreCon Algorithms for Boolean Matrix Factorization

链接https://arxiv.org/abs/2603.13772

作者:Petr Krajča,Martin Trnecka

类目:Information Retrieval (cs.IR)

关键词:Boolean matrix factorization, discovering latent information, latent information hidden, Boolean matrix, analyzing binary data

备注

点击查看摘要

Abstract:Boolean matrix factorization (BMF) is a fundamental tool for analyzing binary data and discovering latent information hidden in the data. Formal Concept Analysis (FCA) provides us with an essential insight into BMF and the design of algorithms. Due to FCA, we have the GreCon and GreCon2 algorithms providing high-quality factorizations at the cost of high memory consumption and long running times. In this paper, we introduce GreCon3, a substantial revision of these algorithms, significantly improving both computational efficiency and memory usage. These improvements are achieved with a novel space-efficient data structure that tracks unprocessed data. Further, a novel strategy incrementally initializing this data structure is proposed. This strategy reduces memory consumption and omits data irrelevant to the remainder of the computation. Moreover, we show that the first factors can be discovered with less effort. Since the first factors tend to describe large portions of the data, this optimization, along with others, significantly contributes to the overall improvement of the algorithm's performance. An experimental evaluation shows that GreCon3 substantially outperforms its predecessor GreCon2. The proposed algorithm thus advances the state of the art in BMF based on FCA and enables efficient factorization of datasets previously infeasible for the GreCon algorithm.

26. 【2603.13730】R3-REC: Reasoning-Driven Recommendation via Retrieval-Augmented LLMs over Multi-Granular Interest Signals

链接https://arxiv.org/abs/2603.13730

作者:Yuchen Miao,Mingxuan Cui,Yitong Zhu,Yu Wang,Siyang Xu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:length-varying item texts, User Intent Reasoning, Item Semantic Extraction, Similar User Collaborative, Interest Polarity Mining

备注: 5 pages, 4 figures, 2 tables. Accepted to the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:This paper addresses two persistent challenges in sequential recommendation: (i) evidence insufficiency-cold-start sparsity together with noisy, length-varying item texts; and (ii) opaque modeling of dynamic, multi-faceted intents across long/short horizons. We propose R3-REC (Reasoning-Retrieval-Recommendation), a prompt-centric, retrieval-augmented framework that unifies Multi-level User Intent Reasoning, Item Semantic Extraction, Long-Short Interest Polarity Mining, Similar User Collaborative Enhancement, and Reasoning-based Interest Matching and Scoring. Across ML-1M, Games, and Bundle, R3-REC consistently surpasses strong neural and LLM baselines, yielding improvements up to +10.2% (HR@1) and +6.4% (HR@5) with manageable end-to-end latency. Ablations corroborate complementary gains of all modules.

27. 【2603.13651】Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

链接https://arxiv.org/abs/2603.13651

作者:Yurui Zhu,Giovanni Colavizza,Matteo Romanello

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:scholarly knowledge-graph construction, downstream scholarly knowledge-graph, Bibliographic reference extraction, Bibliographic reference, knowledge-graph construction

备注: 12 pages, 2 figures. Accepted at the SCOLIA 2026 Workshop (Second Workshop on Scholarly Information Access), co-located with ECIR 2026. Workshop date: April 2, 2026

点击查看摘要

Abstract:Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.

28. 【2603.13537】AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval

链接https://arxiv.org/abs/2603.13537

作者:Tony Joseph,Carlos Pareja,David Lopes Pegna,Abhishek Singh

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Approximate Multimodal Enterprise, multimodal late interaction, unified multimodal late, late interaction retrieval, Multimodal Enterprise Search

备注

点击查看摘要

Abstract:We present AMES (Approximate Multimodal Enterprise Search), a unified multimodal late interaction retrieval architecture which is backend agnostic. AMES demonstrates that fine-grained multimodal late interaction retrieval can be deployed within a production grade enterprise search engine without architectural redesign. Text tokens, image patches, and video frames are embedded into a shared representation space using multi-vector encoders, enabling cross-modal retrieval without modality specific retrieval logic. AMES employs a two-stage pipeline: parallel token level ANN search with per document Top-M MaxSim approximation, followed by accelerator optimized Exact MaxSim re-ranking. Experiments on the ViDoRe V3 benchmark show that AMES achieves competitive ranking performance within a scalable, production ready Solr based system.

29. 【2603.13385】VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

链接https://arxiv.org/abs/2603.13385

作者:Youting Wang,Yuan Tang,Yitian Qian,Chen Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:attacks remains under-evaluated, privacy-critical multimodal scenarios, explicit harmful content, Large Vision-Language Models, semantic visual attacks

备注

点击查看摘要

Abstract:As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated -- alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude~4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude~4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern -- where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4's leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.

30. 【2603.13342】MS2MetGAN: Latent-space adversarial training for metabolite-spectrum matching in MS/MS database search

链接https://arxiv.org/abs/2603.13342

作者:Meng Tsai,Alexzander Dwyer,Estelle Nuckels,Yingfeng Wang

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)

关键词:tandem mass spectra, approach for identifying, tandem mass, Database search, mass spectra

备注

点击查看摘要

Abstract:Database search is a widely used approach for identifying metabolites from tandem mass spectra (MS/MS). In this strategy, an experimental spectrum is matched against a user-specified database of candidate metabolites, and candidates are ranked such that true metabolite-spectrum matches receive the highest scores. Machine-learning methods have been widely incorporated into database-search-based identification tools and have substantially improved performance. To further improve identification accuracy, we propose a new framework for generating negative training samples. The framework first uses autoencoders to learn latent representations of metabolite structures and MS/MS spectra, thereby recasting metabolite-spectrum matching as matching between latent vectors. It then uses a GAN to generate latent vectors of decoy metabolites and constructs decoy metabolite-spectrum matches as negative samples for training. Experimental results show that our tool, MS2MetGAN, achieves better overall performance than existing metabolite identification methods.

31. 【2603.13338】OpenExtract: Automated Data Extraction for Systematic Reviews in Health

链接https://arxiv.org/abs/2603.13338

作者:Jim Achterberg,Bram Van Dijk,Jing Meng,Saif Ul Islam,Gregory Epiphaniou,Carsten Maple,Xuefei Ding,Theodoros N. Arvanitis,Simon Brouwer,Marcel Haas,Marco Spruit

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:automated data extraction, large-scale systematic literature, study presents OpenExtract, study presents, extraction in large-scale

备注

点击查看摘要

Abstract:This study presents OpenExtract, an open-source pipeline for automated data extraction in large-scale systematic literature reviews. The pipeline queries large language models (LLMs) to predict data entries based on relevant sections of scientific articles. To test the efficacy of OpenExtract, we apply it to a systematic literature review in digital health and compare its outputs with those of human researchers. OpenExtract achieves precision and recall scores of 0.8 in this task, indicating that it can be effective at extracting data automatically and efficiently. OpenExtract: this https URL.

32. 【2603.13320】Nepali Passport Question Answering: A Low-Resource Dataset for Public Service Applications

链接https://arxiv.org/abs/2603.13320

作者:Funghang Limbu Begha,Praveen Acharya,Bal Krishna Bal

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:faces significant challenges, computational linguistic resources, Frequently Asked Questions, information retrieval system, retrieval system due

备注: 7 pages, 3 figures, Accepted and presented at RegICON 2025 (Regional International Conference on Natural Language Processing): NLP for East India, North East India and Southeast Asia. [this https URL](https://www.regicon2025.in/accepted-papers)

点击查看摘要

Abstract:Nepali, a low-resource language, faces significant challenges in building an effective information retrieval system due to the unavailability of annotated data and computational linguistic resources. In this study, we attempt to address this gap by preparing a pair-structured Nepali Question-Answer dataset. We focus on Frequently Asked Questions (FAQs) for passport-related services, building a data set for training and evaluation of IR models. In our study, we have fine-tuned transformer-based embedding models for semantic similarity in question-answer retrieval. The fine-tuned models were compared with the baseline BM25. In addition, we implement a hybrid retrieval approach, integrating fine-tuned models with BM25, and evaluate the performance of the hybrid retrieval. Our results show that the fine-tuned SBERT-based models outperform BM25, whereas multilingual E5 embedding-based models achieve the highest retrieval performance among all evaluated models.

33. 【2603.13310】Multi-view Attention Fusion of Heterogeneous Hypergraph with Dynamic Behavioral Profiling for Personalized Learning Resource Recommendation

链接https://arxiv.org/abs/2603.13310

作者:Tao Xie,Yan Li,Yongpan Sheng,Jian Liao

类目:Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:personalized educational recommender, dynamic behavioral profiling, dynamic behavioral, dependencies among learners, resources in personalized

备注

点击查看摘要

Abstract:Hypergraph can capture complex and higher-order dependencies among learners and learning resources in personalized educational recommender systems. Many existing hypergraph-based recommendation approaches underexplored the dynamic behavioral processes inherent to learning and often oversimplified the complementary information embedded across multiple dimensions (i.e. views) within hypergraphs. These limitations compromise both the distinctiveness of learned representations and the model's generalization capabilities, especially under data-sparse conditions typical in educational settings. In this study, we propose a unified model comprising a dynamic behavioral profiling module and a multi-view attention fusion module based on heterogeneous hypergraph construction. The dynamic behavioral profiling module is designed to capture evolving behavioral processes and infer latent higher-order relations crucial for hypergraph completion; The multi-view fusion module cohesively integrates information from distinct relational views, enriching the overall data representation. The proposed model was systematically evaluated on five public benchmark datasets and one real-world, self-constructed dataset. Experimental results demonstrate that the model outperforms baseline methods across most datasets in key metrics; Furthermore, hypergraph completion based on dynamic behavioral profiling contributes significantly to performance gains, though its efficacy is modulated by dataset characteristics. Beyond offline experiments, we implemented a functional prototype system tailored for postgraduate student literature recommendation. A mixed-methods user study was conducted to assess its practical utility. Quantitative analysis revealed significantly higher perceived recommendation quality; Qualitative feedback highlighted enhanced user engagement and satisfaction with the prototype system.

34. 【2603.13307】Suppressing Domain-Specific Hallucination in Construction LLMs: A Knowledge Graph Foundation for GraphRAG and QLoRA on River and Sediment Control Technical Standards

链接https://arxiv.org/abs/2603.13307

作者:Takato Yasuno

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Control Technical Standards, Sediment Control Technical, Sediment Control, Japan River, open-source large language

备注: 17 pages, 5 figures, 8 tables

点击查看摘要

Abstract:This paper addresses the challenge of answering technical questions derived from Japan's River and Sediment Control Technical Standards -- a multi-volume regulatory document covering survey, planning, design, and maintenance of river levees, dams, and sabo structures -- using open-source large language models running entirely on local hardware. We implement and evaluate three complementary approaches: Case A (plain 20B LLM baseline), Case B (8B LLM with QLoRA domain fine-tuning on 715 graph-derived QA pairs), and Case C (20B LLM augmented with a Neo4j knowledge graph via GraphRAG). All three cases use the Swallow series of Japanese-adapted LLMs and are evaluated on a 100-question benchmark spanning 8 technical categories, judged automatically by an independent LLM (Qwen2.5-14B, score 0--3). The key finding is a performance inversion: the 8B QLoRA fine-tuned model (Case B) achieves a judge average of 2.92/3 -- surpassing both the 20B plain baseline (Case A: 2.29/3, $+$0.63) and the 20B GraphRAG approach (Case C: 2.62/3, $+$0.30) -- while running at 3$\times$ faster latency (14.2s vs. 42.2s for Case A). GraphRAG provides moderate gains ($+$0.33 over baseline) but is outperformed by domain-specific fine-tuning in both quality and efficiency. We document the full engineering pipeline, including knowledge graph construction (200 nodes, 268 relations), QLoRA training data generation from Neo4j relations, training on a single GPU (16 GB VRAM) using unsloth, GGUF Q4_K_M quantisation and Ollama deployment, and the graph retrieval and re-ranking design. High-level engineering lessons are distilled in the main body; implementation pitfalls and toolchain details are documented in Supplementary Materials.

35. 【2603.13301】Not All Queries Need Rewriting: When Prompt-Only LLM Refinement Helps and Hurts Dense Retrieval

链接https://arxiv.org/abs/2603.13301

作者:Varun Kotte

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:single-step LLM query, production RAG pipelines, LLM query rewriting, single-step LLM, RAG pipelines

备注

点击查看摘要

Abstract:Prompt-only, single-step LLM query rewriting, where a rewrite is generated from the query alone without retrieval feedback, is commonly used in production RAG pipelines, but its effect on dense retrieval is poorly understood. We present a systematic empirical study across three BEIR benchmarks, two dense retrievers, and multiple training configurations, and find strongly domain-dependent behavior: rewriting degrades nDCG@10 by 9.0 percent on FiQA, improves it by 5.1 percent on TREC-COVID, and has no significant effect on SciFact. We identify a consistent mechanism: degradations co-occur with reduced lexical alignment between rewritten queries and relevant documents, as rewriting replaces domain-specific terms in already well-matched queries. In contrast, improvements arise when rewriting shifts queries toward corpus-preferred terminology and resolves inconsistent nomenclature. Lexical substitution occurs in 95 percent of rewrites across all outcome groups, showing that effectiveness depends on the direction of substitution rather than substitution itself. We also study selective rewriting and find that simple feature-based gating can reduce worst-case regressions but does not reliably outperform never rewriting, with even oracle selection offering only modest gains. Overall, these results show that prompt-only rewriting can be harmful in well-optimized verticals and suggest that domain-adaptive post-training is a safer strategy when supervision or implicit feedback is available.

36. 【2603.13277】Learning Retrieval Models with Sparse Autoencoders

链接https://arxiv.org/abs/2603.13277

作者:Thibault Formal,Maxime Louis,Hervé Dejean,Stéphane Clinchant

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, produced by Large, dense representations produced, Large Language, provide a powerful

备注

点击查看摘要

Abstract:Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned Sparse Retrieval (LSR), whose objective is to encode queries and documents into high-dimensional sparse representations optimized for efficient retrieval. In contrast to existing LSR approaches that project input sequences into the vocabulary space, SAE-based representations offer the potential to produce more semantically structured, expressive, and language-agnostic features. Building on this insight, we introduce SPLARE, a method to train SAE-based LSR models. Our experiments, relying on recently released open-source SAEs, demonstrate that this technique consistently outperforms vocabulary-based LSR in multilingual and out-of-domain settings. SPLARE-7B, a multilingual retrieval model capable of producing generalizable sparse latent embeddings for a wide range of languages and domains, achieves top results on MMTEB's multilingual and English retrieval tasks. We also developed a 2B-parameter variant with a significantly lighter footprint.

37. 【2603.13271】racing the Evolution of Word Embedding Techniques in Natural Language Processing

链接https://arxiv.org/abs/2603.13271

作者:Minh Anh Nguyen,Kuheli Sai,Minh Nguyen

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:natural language processing, work traces, traces the evolution, evolution of word-embedding, NLP

备注

点击查看摘要

Abstract:This work traces the evolution of word-embedding techniques within the natural language processing (NLP) literature. We collect and analyze 149 research articles spanning the period from 1954 to 2025, providing both a comprehensive methodological review and a data-driven bibliometric analysis of how representation learning has developed over seven decades. Our study covers four major embedding paradigms, statistical representation-based methods (one-hot encoding, bag-of-words, TF-IDF), static word embeddings (Word2Vec, GloVe, FastText), contextual word embeddings (ELMo, BERT, GPT), and sentence/document embeddings, critically discussing the strengths, limitations, and intellectual lineage connecting each category. Beyond the methodological survey, we conduct a formal era comparison using GPT-3's release as a dividing line, applying seven hypothesis tests to quantify shifts in research focus, collaboration patterns, and institutional involvement. Our analysis reveals a dramatic post-GPT-3 paradigm shift: contextual and sentence-level methods now dominate at 6.4X the odds of the pre-GPT-3 era, mean team sizes have grown significantly (p = 0.018), and 30 entirely new techniques have emerged while 54 pre-GPT-3 methods received no further attention. These findings, combined with evidence of rising industry involvement, provide a quantitative account of how the field's epistemic priorities have been reshaped by the advent of large language models.

38. 【2603.13264】Federated Personal Knowledge Graph Completion with Lightweight Large Language Models for Personalized Recommendations

链接https://arxiv.org/abs/2603.13264

作者:Fernando Spadea,Oshani Seneviratne

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:recommendation increasingly relies, Personalized recommendation increasingly, Language Models, motivating approaches, centralizing their information

备注

点击查看摘要

Abstract:Personalized recommendation increasingly relies on private user data, motivating approaches that can adapt to individuals without centralizing their information. We present Federated Targeted Recommendations with Evolving Knowledge graphs and Language Models (FedTREK-LM), a framework that unifies lightweight large language models (LLMs), evolving personal knowledge graphs (PKGs), federated learning (FL), and Kahneman-Tversky Optimization to enable scalable, decentralized personalization. By prompting LLMs with structured PKGs, FedTREK-LM performs context-aware reasoning for personalized recommendation tasks such as movie and recipe suggestions. Across three lightweight Qwen3 models (0.6B, 1.7B, 4B), FedTREK-LM consistently and substantially outperforms state-of-the-art KG completion and federated recommendation baselines (HAKE, KBGAT, and FedKGRec), achieving more than a 4x improvement in F1-score on the movie and food benchmarks. Our results further show that real user data is critical for effective personalization, as synthetic data degrades performance by up to 46%. Overall, FedTREK-LM offers a practical paradigm for adaptive, LLM-powered personalization that generalizes across decentralized, evolving user PKGs.

39. 【2603.13253】A Counterfactual Approach for Addressing Individual User Unfairness in Collaborative Recommender System

链接https://arxiv.org/abs/2603.13253

作者:Nikita Baidya,Bidyut Kr. Patra,Ratnakar Dash

类目:Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:Recommender Systems, users, suggest their products, enterprises to suggest, individual user unfairness

备注

点击查看摘要

Abstract:Recommender Systems (RSs) are exploited by various business enterprises to suggest their products (items) to consumers (users). Collaborative filtering (CF) is a widely used variant of RSs which learns hidden patterns from user-item interactions for recommending items to users. Recommendations provided by the traditional CF models are often biased. Generally, such models learn and update embeddings for all the users, thereby overlooking the biases toward each under-served users individually. This leads to certain users receiving poorer recommendations than the rest. Such unfair treatment toward users incur loss to the business houses. There is limited research which addressed individual user unfairness problem (IUUP). Existing literature employed explicit exploration-based multi-armed bandits, individual user unfairness metric, and explanation score to address this issue. Although, these works elucidate and identify the underlying individual user unfairness, however, they do not provide solutions for it. In this paper, we propose a dual-step approach which identifies and mitigates IUUP in recommendations. In the proposed work, we counterfactually introduce new interactions to the candidate users (one at a time) and subsequently analyze the benefit from this perturbation. This improves the user engagement with other users and items. Thus, the model can learn effective embeddings across the users. To showcase the effectiveness of the proposed counterfactual methodology, we conducted experiments on MovieLens-100K, Amazon Beauty and MovieLens-1M datasets. The experimental results validate the superiority of the proposed approach over the existing techniques.

40. 【2603.15416】Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

链接https://arxiv.org/abs/2603.15416

作者:Michael Paris,Grigori Paris,Fabian Baumann

类目:Physics and Society (physics.soc-ph); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Information Theory (cs.IT)

关键词:completeness remains challenging, archives preserve portions, Web archives preserve, remains challenging, preserve portions

备注

点击查看摘要

Abstract:Web archives preserve portions of the web, but quantifying their completeness remains challenging. Prior approaches have estimated the coverage of a crawl by either comparing the outcomes of multiple crawlers, or by comparing the results of a single crawl to external ground truth datasets. We propose a method to estimate the absolute coverage of a crawl using only the archive's own longitudinal data, i.e., the data collected by multiple subsequent crawls. Our key insight is that coverage can be estimated from the empirical URL overlaps between subsequent crawls, which are in turn well described by a simple urn process. The parameters of the urn model can then be inferred from longitudinal crawl data using linear regression. Applied to our focused crawl configuration of the German Academic Web, with 15 semi-annual crawls between 2013-2021, we find a coverage of approximately 46 percent of the crawlable URL space for the stable crawl configuration regime. Our method is extremely simple, requires no external ground truth, and generalizes to any longitudinal focused crawl.

计算机视觉

1. 【2603.15620】owards Generalizable Robotic Manipulation in Dynamic Environments

链接https://arxiv.org/abs/2603.15620

作者:Heng Fang,Shangru Li,Shuhan Wang,Xuanyang Xi,Dingkang Liang,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:models excel, moving targets, environments with moving, dynamic, dynamic manipulation

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at this https URL.

2. 【2603.15618】Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

链接https://arxiv.org/abs/2603.15618

作者:Yulin Luo,Hao Chen,Zhuangzhe Wu,Bowen Sui,Jiaming Liu,Chenyang Gu,Zhuoyang Liu,Qiuxuan Feng,Jiale Yu,Shuo Gu,Peng Jia,Pheng-Ann Heng,Shanghang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prediction critically depends, reliable action prediction, action prediction critically, VLA models, VLA

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

3. 【2603.15616】GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

链接https://arxiv.org/abs/2603.15616

作者:Xincheng Shuai,Ziye Li,Henghui Ding,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating accurate glyphs, Generating accurate, text rendering, visual text rendering, essential yet challenging

备注: CVPR 2026, Project Page: [this https URL](https://henghuiding.com/GlyphPrinter/)

点击查看摘要

Abstract:Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.

4. 【2603.15614】ri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

链接https://arxiv.org/abs/2603.15614

作者:Zhenghong Zhou,Xiaohang Zhan,Zhiqin Chen,Soo Ye Kim,Nanxuan Zhao,Haitian Zheng,Qing Liu,He Zhang,Zhe Lin,Yuqian Zhou,Jiebo Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent video diffusion, made remarkable strides, limits practical customizability, fine-grained control remains, video diffusion models

备注: Project page: [this https URL](https://zhouzhenghong-gt.github.io/Tri-Prompting-Page/)

点击查看摘要

Abstract:Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

5. 【2603.15612】HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions

链接https://arxiv.org/abs/2603.15612

作者:Yukang Cao,Haozhe Xie,Fangzhou Hong,Long Zhuo,Zhaoxi Chen,Liang Pan,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:including sparse-view images, casual captures, including sparse-view, monocular videos, unified framework

备注: [this https URL](https://yukangcao.github.io/HSImul3R/)

点击查看摘要

Abstract:We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.

6. 【2603.15603】Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

链接https://arxiv.org/abs/2603.15603

作者:Timing Yang,Sicheng He,Hongyi Jing,Jiawei Yang,Zhijian Liu,Chuhang Zou,Yue Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human mesh recovery, Fast SAM, precludes real-time application, present Fast SAM, accuracy in monocular

备注

点击查看摘要

Abstract:SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

7. 【2603.15600】From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

链接https://arxiv.org/abs/2603.15600

作者:Yibin Liu,Yaxing Lyu,Daqi Gao,Zhixuan Liang,Weiliang Tang,Shilong Mu,Xiaokang Yang,Yao Mu

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate process supervision, long-horizon robotic manipulation, process supervision remains, Accurate process, robotic manipulation

备注: 31 pages

点击查看摘要

Abstract:Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

8. 【2603.15597】AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

链接https://arxiv.org/abs/2603.15597

作者:Pengjun Fang,Yingqing He,Yazhou Xing,Qifeng Chen,Ser-Nam Lim,Harry Yang

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:prompts alongside visual, alongside visual information, text prompts alongside, methods predominantly rely, predominantly rely

备注: Accepted at ICLR 2026. 15 pages, 5 figures

点击查看摘要

Abstract:Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.

9. 【2603.15583】Grounding World Simulation Models in a Real-World Metropolis

链接https://arxiv.org/abs/2603.15583

作者:Junyoung Seo,Hyunwook Choi,Minkyung Kwon,Jinhyeok Choi,Siyoon Jin,Gayoung Lee,Junho Kim,JoungBin Lee,Geonmo Gu,Dongyoon Han,Sangdoo Yun,Seungryong Kim,Jin-Hwa Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:world simulation model, Seoul World Model, world models, World Model, generative world models

备注: project page: [this https URL](https://seoul-world-model.github.io/)

点击查看摘要

Abstract:What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

10. 【2603.15574】Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments

链接https://arxiv.org/abs/2603.15574

作者:Aaditya Khanal,Junxiu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe domain shift, compound domain shift, domain shift, Toggle, practical deployment gap

备注: 6 pages, 7 figures

点击查看摘要

Abstract:The practical deployment gap -- transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation -- introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC = 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.

Comments:
6 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.15574 [cs.CV]

(or
arXiv:2603.15574v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.15574

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Aaditya Khanal [view email] [v1]
Mon, 16 Mar 2026 17:37:17 UTC (3,166 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments, by Aaditya Khanal and 1 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-03

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

11. 【2603.15558】Panoramic Affordance Prediction

链接https://arxiv.org/abs/2603.15558

作者:Zixin Zhang,Chenfei Liao,Hongfei Zhang,Harold Haodong Chen,Kanghao Chen,Zichen Wen,Litao Guo,Bin Ren,Xu Zheng,Yinchuan Li,Xuming Hu,Nicu Sebe,Ying-Cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Affordance prediction serves, Affordance prediction, Panoramic Affordance Prediction, Fields of View, critical bridge

备注

点击查看摘要

Abstract:Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

12. 【2603.15557】Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

链接https://arxiv.org/abs/2603.15557

作者:Lexiang Xiong,Qi Li,Jingwen Ye,Xinchao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:factually incorrect statements, generate plausible, incorrect statements, posing a critical, trustworthy deployment

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

13. 【2603.15555】Learning Latent Proxies for Controllable Single-Image Relighting

链接https://arxiv.org/abs/2603.15555

作者:Haoze Zheng,Zihao Wang,Xianfeng Wu,Yajing Bai,Yexin Liu,Yun Li,Xiaogang Xu,Harry Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:materials remain unobserved, Single-image relighting, highly under-constrained, produce large, nonlinear variations

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.

14. 【2603.15553】Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

链接https://arxiv.org/abs/2603.15553

作者:Scott C. Lowe,Anthony Fuller,Sageev Oore,Evan Shelhamer,Graham W. Taylor

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:raw low-level data, reconstruct raw low-level, high-level abstract embeddings, predict high-level abstract, abstract embeddings

备注

点击查看摘要

Abstract:The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

15. 【2603.15546】Kimodo: Scaling Controllable Human Motion Generation

链接https://arxiv.org/abs/2603.15546

作者:Davis Rempe,Mathis Petrovich,Ye Yuan,Haotian Zhang,Xue Bin Peng,Yifeng Jiang,Tingwu Wang,Umar Iqbal,David Minor,Michael de Ruyter,Jiefeng Li,Chen Tessler,Edy Lim,Eugene Jeong,Sam Wu,Ehsan Hassani,Michael Huang,Jin-Bey Yu,Chaeyeon Chung,Lina Song,Olivier Dionne,Jan Kautz,Simon Yuen,Sanja Fidler

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:applications in robotics, increasingly important, important for applications, High-quality human motion, human motion data

备注: Project page: [this https URL](https://research.nvidia.com/labs/sil/projects/kimodo/)

点击查看摘要

Abstract:High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.

16. 【2603.15525】Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models

链接https://arxiv.org/abs/2603.15525

作者:Amy Rafferty,Rishi Ramaesh,Ajitha Rajan

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:diagnostic models demands, demands robustness, disease presentations, full spectrum, spectrum of disease

备注

点击查看摘要

Abstract:The clinical deployment of AI diagnostic models demands more than benchmark accuracy - it demands robustness across the full spectrum of disease presentations. However, publicly available chest radiographic datasets systematically underrepresent critical clinical feature combinations, leaving models under-trained precisely where clinical stakes are highest. We present CARS, a clinically aware and anatomically grounded framework that addresses this gap through principled synthetic image generation. CARS applies targeted perturbations to clinical feature vectors, enabling controlled insertion and deletion of pathological findings while explicitly preserving anatomical structure. We evaluate CARS across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior feature perturbation approaches, fine-tuning on CARS-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong feature alignment, and low semantic uncertainty. Independent evaluation by two expert radiologists further confirms realism and clinical agreement. As the field moves toward regulated clinical AI, CARS demonstrates that anatomically faithful synthetic data generation for better feature space coverage is a viable and effective strategy for improving both the performance and trustworthiness of chest X-ray classification systems - without compromising clinical integrity.

17. 【2603.15512】FreeTalk: Emotional Topology-Free 3D Talking Heads

链接https://arxiv.org/abs/2603.15512

作者:Federico Nocentini,Thomas Besnier,Claudio Ferrari,Stefano Berretti,Mohamed Daoudi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preventing effective deployment, approaches remain tied, advanced rapidly, preventing effective, deployment on raw

备注

点击查看摘要

Abstract:Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.

18. 【2603.15507】Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference

链接https://arxiv.org/abs/2603.15507

作者:Nitin Priyadarshini Shankar,Soham Lahiri,Sheetal Kalyani,Saurav Prakash

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Federated Learning, preserves privacy, privacy by distributing, Learning, Abstract

备注: 26 pages, 13 figures

点击查看摘要

Abstract:Federated Learning (FL) preserves privacy by distributing training across devices. However, using DNNs is computationally intensive at the low-powered edge during inference. Edge deployment demands models that simultaneously optimize memory footprint and computational efficiency, a dilemma where conventional DNNs fail by exceeding resource limits. Traditional post-training binarization reduces model size but suffers from severe accuracy loss due to quantization errors. To address these challenges, we propose FedBNN, a rotation-aware binary neural network framework that learns binary representations directly during local training. By encoding each weight as a single bit $\{+1, -1\}$ instead of a $32$-bit float, FedBNN shrinks the model footprint, significantly reducing runtime (during inference) FLOPs and memory requirements in comparison to federated methods using real models. Evaluations across multiple benchmark datasets demonstrate that FedBNN significantly reduces resource consumption while performing similarly to existing federated methods using real-valued models.

19. 【2603.15497】Real-Time Oriented Object Detection Transformer in Remote Sensing Images

链接https://arxiv.org/abs/2603.15497

作者:Zeyu Ding,Yong Zhou,Jiaqi Zhao,Wen-Liang Du,Xixi Li,Rui Yao,Abdulmotaleb El Saddik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained popularity due, Recent real-time detection, Recent real-time, simplicity and efficiency, gained popularity

备注: IEEE Transactions on Geoscience and Remote Sensing, 2026, doi [https://doi.org/10.1109/TGRS.2026.3671683](https://doi.org/10.1109/TGRS.2026.3671683)

点击查看摘要

Abstract:Recent real-time detection transformers have gained popularity due to their simplicity and efficiency. However, these detectors do not explicitly model object rotation, especially in remote sensing imagery where objects appear at arbitrary angles, leading to challenges in angle representation, matching cost, and training stability. In this paper, we propose a real-time oriented object detection transformer, the first real-time end-to-end oriented object detector to the best of our knowledge, that addresses the above issues. Specifically, angle distribution refinement is proposed to reformulate angle regression as an iterative refinement of probability distributions, thereby capturing the uncertainty of object rotation and providing a more fine-grained angle representation. Then, we incorporate a Chamfer distance cost into bipartite matching, measuring box distance via vertex sets, enabling more accurate geometric alignment and eliminating ambiguous matches. Moreover, we propose oriented contrastive denoising to stabilize training and analyze four noise modes. We observe that a ground truth can be assigned to different index queries across different decoder layers, and analyze this issue using the proposed instability metric. We design a series of model variants and experiments to validate the proposed method. Notably, our O2-DFINE-L, O2-RTDETR-R50 and O2-DEIM-R50 achieve 77.73%/78.45%/80.15% AP50 on DOTA1.0 and 132/119/119 FPS on the 2080ti GPU. Code is available at this https URL.

20. 【2603.15484】RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance

链接https://arxiv.org/abs/2603.15484

作者:Xianbao Hou,Yonghao He,Zeyd Boukhers,John See,Hu Su,Wei Sui,Cong Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:annotated data scarcity, Diffusion models, remote sensing, mitigated the impact, impact of annotated

备注

点击查看摘要

Abstract:Diffusion models have significantly mitigated the impact of annotated data scarcity in remote sensing (RS). Although recent approaches have successfully harnessed these models to enable diverse and controllable Layout-to-Image (L2I) synthesis, they still suffer from limited fine-grained control and fail to strictly adhere to bounding box constraints. To address these limitations, we propose RSGen, a plug-and-play framework that leverages diverse edge guidance to enhance layout-driven RS image generation. Specifically, RSGen employs a progressive enhancement strategy: 1) it first enriches the diversity of edge maps composited from retrieved training instances via Image-to-Image generation; and 2) subsequently utilizes these diverse edge maps as conditioning for existing L2I models to enforce pixel-level control within bounding boxes, ensuring the generated instances strictly adhere to the layout. Extensive experiments across three baseline models demonstrate that RSGen significantly boosts the capabilities of existing L2I models. For instance, with CC-Diff on the DOTA dataset for oriented object detection, we achieve remarkable gains of +9.8/+12.0 in YOLOScore mAP50/mAP50-95 and +1.6 in mAP on the downstream detection task. Our code will be publicly available: this https URL

21. 【2603.15478】ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

链接https://arxiv.org/abs/2603.15478

作者:Ruonan Yu,Zhenxiong Tan,Zigeng Chen,Songhua Liu,Xinchao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prompting growing interest, demonstrated remarkable scalability, video diffusion transformers, Diffusion Transformers, video diffusion

备注: Working in progress, code is at [this https URL](https://github.com/Lexie-YU/ViFeEdit)

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available this https URL.

22. 【2603.15475】Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation

链接https://arxiv.org/abs/2603.15475

作者:Yuanfan Zheng,Kunyu Peng,Xu Zheng,Kailun Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:attracted growing interest, Cross-domain panoramic semantic, Cross-domain panoramic, Adaptive Panoramic Segmentation, enables comprehensive

备注: Accepted to CVPR 2026. The code is available at [this https URL](https://github.com/zyfone/EDA-PSeg)

点击查看摘要

Abstract:Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at this https URL.

23. 【2603.15472】Anchor then Polish for Low-light Enhancement

链接https://arxiv.org/abs/2603.15472

作者:Tianle Du,Mingjia Li,Hainuo Wang,Xiaojie Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including poor illumination, entangled degradations, poor illumination, texture interference, challenging due

备注

点击查看摘要

Abstract:Low-light image enhancement is challenging due to entangled degradations, mainly including poor illumination, color shifts, and texture interference. Existing methods often rely on complex architectures to address these issues jointly but may overfit simple physical constraints, leading to global distortions. This work proposes a novel anchor-then-polish (ATP) framework to fundamentally decouple global energy alignment from local detail refinement. First, macro anchoring is customized to (greatly) stabilize luminance distribution and correct color by learning a scene-adaptive projection matrix with merely 12 degrees of freedom, revealing that a simple linear operator can effectively align global energy. The macro anchoring then reduces the task to micro polishing, which further refines details in the wavelet domain and chrominance space under matrix guidance. A constrained luminance update strategy is designed to ensure global consistency while directing the network to concentrate on fine-grained polishing. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance, producing visually natural and quantitatively superior low-light enhancements.

24. 【2603.15470】Automated Counting of Stacked Objects in Industrial Inspection

链接https://arxiv.org/abs/2603.15470

作者:Corentin Dumery,Noa Etté,Aoxiang Fan,Ren Li,Jingyi Xu,Hieu Le,Pascal Fua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-throughput inventory tracking, fundamental computer vision, computer vision task, high-throughput inventory, assurance are critical

备注: This preprint is a journal extension of our ICCV25 Oral paper: [this https URL](https://corentindumery.github.io/projects/stacks.html)

点击查看摘要

Abstract:Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.

25. 【2603.15467】Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

链接https://arxiv.org/abs/2603.15467

作者:Yurui Dong,Ziyue Wang,Shuyun Lu,Dairu Liu,Xuechen Liu,Fuwen Luo,Peng Li,Yang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, recently made rapid, made rapid progress

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model's ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.

26. 【2603.15436】MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts

链接https://arxiv.org/abs/2603.15436

作者:Zheng Zhang,Qinchuan Zhang,Yuteng Ye,Zhi Chen,Penglei Ji,Mengfei Li,Wenxiao Zhang,Yuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating high-quality textures, Generating high-quality, challenging task, Generating, multiview

备注

点击查看摘要

Abstract:Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.

27. 【2603.15433】Real-Time Human Frontal View Synthesis from a Single Image

链接https://arxiv.org/abs/2603.15433

作者:Fangyu Lin,Yingdong Hu,Lunjie Zhu,Zhening Liu,Yushi Huang,Zehong Lin,Jun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex multi-camera setups, democratizing immersive, multi-camera setups, Photorealistic human, frontal view synthesis

备注

点击查看摘要

Abstract:Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.

28. 【2603.15432】Gym-V: A Unified Vision Environment System for Agentic Vision Research

链接https://arxiv.org/abs/2603.15432

作者:Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systems increasingly rely, agentic systems increasingly, verifiable rewards, rapid iteration, fair comparison

备注

点击查看摘要

Abstract:As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.

29. 【2603.15415】AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

链接https://arxiv.org/abs/2603.15415

作者:Zhenyu Xie,Ji Xia,Michael Kampffmeyer,Panwen Hu,Zehua Ma,Yujian Zheng,Jing Wang,Zheng Chong,Xujie Zhang,Xianhang Cheng,Xiaodan Liang,Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Controllable character animation, animation remains underexplored, multi-character animation remains, Controllable character, animation remains

备注

点击查看摘要

Abstract:Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...

30. 【2603.15404】Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context

链接https://arxiv.org/abs/2603.15404

作者:Mohamed Aziz Younes,Nicolas Saunier,Guillaume-Alexandre Bilodeau

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:shared mobility, progressive automation, automation of transport, transport promises, promises to enhance

备注: 10 pages, 6 figures

点击查看摘要

Abstract:The progressive automation of transport promises to enhance safety and sustainability through shared mobility. Like other vehicles and road users, and even more so for such a new technology, it requires monitoring to understand how it interacts in traffic and to evaluate its safety. This can be done with fixed cameras and video object detection. However, the addition of new detection targets generally requires a fine-tuning approach for regular detection methods. Unfortunately, this implementation strategy will lead to a phenomenon known as catastrophic forgetting, which causes a degradation in scene understanding. In road safety applications, preserving contextual scene knowledge is of the utmost importance for protecting road users. We introduce the Adaptive Residual Context (ARC) architecture to address this. ARC links a frozen context branch and trainable task-specific branches through a Context-Guided Bridge, utilizing attention to transfer spatial features while preserving pre-trained representations. Experiments on a custom dataset show that ARC matches fine-tuned baselines while significantly improving knowledge retention, offering a data-efficient solution to add new vehicle categories for complex urban environments.

31. 【2603.15403】Pointing-Based Object Recognition

链接https://arxiv.org/abs/2603.15403

作者:Lukáš Hajdúch,Viktor Kocur

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human pointing gestures, gestures using RGB, recognizing objects targeted, paper presents, presents a comprehensive

备注: Submitted to InnovAIte conference

点击查看摘要

Abstract:This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.

32. 【2603.15396】AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

链接https://arxiv.org/abs/2603.15396

作者:Noe Claudel,Weisi Guo,Yang Xing

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:impersonation attacks pose, critical risk, increasingly deployed, deployed in surveillance, pose a critical

备注

点击查看摘要

Abstract:Facial identification systems are increasingly deployed in surveillance and yet their vulnerability to adversarial evasion and impersonation attacks pose a critical risk. This paper introduces a novel framework for generating adversarial patches capable of both evasion and impersonation attacks against deep re-identification models across non-overlapping cameras. Unlike prior approaches that require iterative patch optimisation for each target, our method employs a conditional encoder-decoder network to synthesize adversarial patches in a single forward pass, guided by multi-scale features from source and target images. The patches are optimised with a dual adversarial objective comprising of pull and push terms. To enhance imperceptibility and aid physical deployment, we further integrate naturalistic patch generation using pre-trained latent diffusion models. Experiments on standard pedestrian (Market-1501, DukeMTMCreID) and facial recognition benchmarks (CelebA-HQ, PubFig) datasets demonstrate the effectiveness of the proposed method. Our adversarial evasion attacks reduce mean Average Precision from 90% to 0.4% in white-box settings and from 72% to 0.4% in black-box settings, showing strong cross-model generalization. In targeted impersonation attacks, our framework achieves a success rate of 27% on CelebA-HQ, competing with other patch-based methods. We go further to use clustering of activation maps to interpret which features are most used by adversarial attacks and propose a pathway for future countermeasures. The results highlight the practicality of adversarial patch attacks on retrieval-based systems and underline the urgent need for robust defense strategies.

33. 【2603.15386】RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

链接https://arxiv.org/abs/2603.15386

作者:Fernando Ropero,Erkin Turkoz,Daniel Matos,Junqing Du,Antonio Ruiz,Yanfeng Zhang,Lu Liu,Mingwei Sun,Yongliang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Language Models, Language Models, Visual Language, reasoning, main paradigm

备注

点击查看摘要

Abstract:Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.

34. 【2603.15374】Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation

链接https://arxiv.org/abs/2603.15374

作者:Xiaoxian Zhang,Minghai Shi,Lei Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate monocular depth, monocular depth estimation, Accurate monocular, localization and navigation, monocular depth

备注: 15 pages

点击查看摘要

Abstract:Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.

35. 【2603.15370】rajectory-Diversity-Driven Robust Vision-and-Language Navigation

链接https://arxiv.org/abs/2603.15370

作者:Jiangyang Li,Cong Wan,SongLin Dong,Chenhao Ding,Qiang Wang,Zhiheng Ma,Yihong Gong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language instructions, navigate photo-realistic environments, Relative Policy Optimization, language instructions, Group Relative Policy

备注: 17pages, 5 figures

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.

36. 【2603.15368】IRIS: Intersection-aware Ray-based Implicit Editable Scenes

链接https://arxiv.org/abs/2603.15368

作者:Grzegorz Wilczyński,Mikołaj Zieliński,Krzysztof Byrski,Joanna Waczyńska,Dominik Belter,Przemysław Spurek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Radiance Fields achieve, Neural Radiance Fields, strong empirical results, Gaussian splatting offers, Neural Radiance

备注

点击查看摘要

Abstract:Neural Radiance Fields achieve high-fidelity scene representation but suffer from costly training and rendering, while 3D Gaussian splatting offers real-time performance with strong empirical results. Recently, solutions that harness the best of both worlds by using Gaussians as proxies to guide neural field evaluations, still suffer from significant computational inefficiencies. They typically rely on stochastic volumetric sampling to aggregate features, which severely limits rendering performance. To address this issue, a novel framework named IRIS (Intersection-aware Ray-based Implicit Editable Scenes) is introduced as a method designed for efficient and interactive scene editing. To overcome the limitations of standard ray marching, an analytical sampling strategy is employed that precisely identifies interaction points between rays and scene primitives, effectively eliminating empty space processing. Furthermore, to address the computational bottleneck of spatial neighbor lookups, a continuous feature aggregation mechanism is introduced that operates directly along the ray. By interpolating latent attributes from sorted intersections, costly 3D searches are bypassed, ensuring geometric consistency, enabling high-fidelity, real-time rendering, and flexible shape editing. Code can be found at this https URL.

37. 【2603.15365】A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression

链接https://arxiv.org/abs/2603.15365

作者:Yuming Han,Jooho Kim,Anish Shakya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing remote sensing, Existing remote, remote sensing image, remote sensing, methods still explore

备注

点击查看摘要

Abstract:Existing remote sensing image compression methods still explore to balance high compression efficiency with the preservation of fine details and task-relevant information. Meanwhile, high-resolution drone imagery offers valuable structural details for urban monitoring and disaster assessment, but large-area datasets can easily reach hundreds of gigabytes, creating significant challenges for storage and long-term management. In this paper, we propose a PPO-based bitrate allocation Conditional Diffusion Compression (PCDC) framework. PCDC integrates a conditional diffusion decoder with a PPO-based block-wise bitrate allocation strategy to achieve high compression ratios while maintaining strong perceptual performance. We also release a high-resolution drone image dataset with richer structural details at a consistent low altitude over residential neighborhoods in coastal urban areas. Experimental results show compression ratios of 19.3x on DIV2K and 21.2x on the drone image dataset. Moreover, downstream object detection experiments demonstrate that the reconstructed images preserve task-relevant information with negligible performance loss.

38. 【2603.15348】Oscillating Dispersion for Maximal Light-throughput Spectral Imaging

链接https://arxiv.org/abs/2603.15348

作者:Jiuyun Zhang,Zhan Shi,Linsen Chen,Xun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing computational spectral, systems typically rely, imaging systems typically, Dispersion Imaging Spectrometer, Existing computational

备注

点击查看摘要

Abstract:Existing computational spectral imaging systems typically rely on coded aperture and beam splitters that block a substantial fraction of incident light, degrading reconstruction quality under light-starved conditions. To address this limitation, we develop the Oscillating Dispersion Imaging Spectrometer (ODIS), which for the first time achieves near-full light throughput by axially translating a disperser between the conjugate image plane and a defocused position, sequentially capturing a panchromatic (PAN) image and a dispersed measurement along a single optical path. We further propose a PAN-guided Dispersion-Aware Deep Unfolding Network (PDAUN) that recovers high-fidelity spectral information from maskless dispersion under PAN structural guidance. Its data-fidelity step derives an FFT-Woodbury preconditioned solver by exploiting the cyclic-convolution property of the ODIS forward model, while a Dispersion-Aware Deformable Convolution module (DADC) corrects sub-pixel spectral misalignment using PAN features. Experiments show state-of-the-art performance on standard benchmarks, and cross-system comparisons confirm that ODIS yields decisive gains under low illumination. High-fidelity reconstruction is validated on a physical prototype.

39. 【2603.15330】MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

链接https://arxiv.org/abs/2603.15330

作者:Jiacheng Dong,Huan Li,Sicheng Zhou,Wenhao Hu,Weili Xu,Yan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental task, fundamental capability, spatial intelligence, fundamental, Reconstruction

备注

点击查看摘要

Abstract:Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining $O(1)$ inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300--500 frame streams on 7-Scenes. The code is available at this https URL

40. 【2603.15304】UE5-Forest: A Photorealistic Synthetic Stereo Dataset for UAV Forestry Depth Estimation

链接https://arxiv.org/abs/2603.15304

作者:Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous UAV-based pruning, complex canopy geometry, canopy geometry defeat, geometry defeat conventional, Dense ground-truth disparity

备注

点击查看摘要

Abstract:Dense ground-truth disparity maps are practically unobtainable in forestry environments, where thin overlapping branches and complex canopy geometry defeat conventional depth sensors -- a critical bottleneck for training supervised stereo matching networks for autonomous UAV-based pruning. We present UE5-Forest, a photorealistic synthetic stereo dataset built entirely in Unreal Engine 5 (UE5). One hundred and fifteen photogrammetry-scanned trees from the Quixel Megascans library are placed in virtual scenes and captured by a simulated stereo rig whose intrinsics -- 63 mm baseline, 2.8 mm focal length, 3.84 mm sensor width -- replicate the ZED Mini camera mounted on our drone. Orbiting each tree at up to 2 m across three elevation bands (horizontal, +45 degrees, -45 degrees) yields 5,520 rectified 1920 x 1080 stereo pairs with pixel-perfect disparity labels. We provide a statistical characterisation of the dataset -- covering disparity distributions, scene diversity, and visual fidelity -- and a qualitative comparison with real-world Canterbury Tree Branches imagery that confirms the photorealistic quality and geometric plausibility of the rendered data. The dataset will be publicly released to provide the community with a ready-to-use benchmark and training resource for stereo-based forestry depth estimation.

41. 【2603.15302】Generative Video Compression with One-Dimensional Latent Representation

链接https://arxiv.org/abs/2603.15302

作者:Zihan Zheng,Zhaoyang Jia,Naifu Xue,Jiahao Li,Bin Li,Zongyu Guo,Xiaoyi Zhang,Zhenghao Chen,Houqiang Li,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, employ high-capacity generative, high-capacity generative decoders, latent grid, generative video codec

备注: CVPR2026

点击查看摘要

Abstract:Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4\% under LPIPS and 68.8\% under DISTS on the HEVC Class B dataset, surpassing the previous video compression this http URL: this https URL

42. 【2603.15300】GATE-AD: Graph Attention Network Encoding For Few-Shot Industrial Visual Anomaly Detection

链接https://arxiv.org/abs/2603.15300

作者:Aggelos Psiris,Yannis Panagakis,Maria Vakalopoulou,Georgios Th. Papadopoulos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:automated product inspection, product inspection systems, defect-free training samples, Industrial Visual Anomaly, Visual Anomaly Detection

备注

点击查看摘要

Abstract:Few-Shot Industrial Visual Anomaly Detection (FS-IVAD) comprises a critical task in modern manufacturing settings, where automated product inspection systems need to identify rare defects using only a handful of normal/defect-free training samples. In this context, the current study introduces a novel reconstruction-based approach termed GATE-AD. In particular, the proposed framework relies on the employment of a masked, representation-aligned Graph Attention Network (GAT) encoding scheme to learn robust appearance patterns of normal samples. By leveraging dense, patch-level, visual feature tokens as graph nodes, the model employs stacked self-attentional layers to adaptively encode complex, irregular, non-Euclidean, local relations. The graph is enhanced with a representation alignment component grounded on a learnable, latent space, where high reconstruction residual areas (i.e., defects) are assessed using a Scaled Cosine Error (SCE) objective function. Extensive comparative evaluation on the MVTec AD, VisA, and MPDD industrial defect detection benchmarks demonstrates that GATE-AD achieves state-of-the-art performance across the $1$- to $8$-shot settings, combining the highest detection accuracy (increase up to $1.8\%$ in image AUROC in the 8-shot case in MPDD) with the lowest per-image inference latency (at least $25.05\%$ faster), compared to the best-performing literature methods. In order to facilitate reproducibility and further research, the source code of GATE-AD is available at this https URL.

43. 【2603.15279】Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling

链接https://arxiv.org/abs/2603.15279

作者:Aram Davtyan,Leello Tadesse Dadi,Volkan Cevher,Paolo Favaro

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Conditional Flow Matching, continuous normalizing flows, Conditional Flow, Flow Matching, normalizing flows

备注: Patched from ICLR2025. Code: [this https URL](https://github.com/araachie/loom-cfm)

点击查看摘要

Abstract:Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across multiple datasets. LOOM-CFM also enhances distillation initialization and supports high-resolution synthesis in latent space training.

44. 【2603.15276】Dataset Diversity Metrics and Impact on Classification Models

链接https://arxiv.org/abs/2603.15276

作者:Théo Sourget,Niclas Claßen,Jack Junchi Xu,Rob van der Goot,Veronika Cheplygina

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dataset diversity metrics, diversity metrics, important aspect, aspect to obtain, obtain a robust

备注

点击查看摘要

Abstract:The diversity of training datasets is usually perceived as an important aspect to obtain a robust model. However, the definition of diversity is often not defined or differs across papers, and while some metrics exist, the quantification of this diversity is often overlooked when developing new algorithms. In this work, we study the behaviour of multiple dataset diversity metrics for image, text and metadata using MorphoMNIST, a toy dataset with controlled perturbations, and PadChest, a publicly available chest X-ray dataset. We evaluate whether these metrics correlate with each other but also with the intuition of a clinical expert. We also assess whether they correlate with downstream-task performance and how they impact the training dynamic of the models. We find limited correlations between the AUC and image or metadata reference-free diversity metrics, but higher correlations with the FID and the semantic diversity metrics. Finally, the clinical expert indicates that scanners are the main source of diversity in practice. However, we find that the addition of another scanner to the training set leads to shortcut learning. The code used in this study is available at this https URL

45. 【2603.15271】Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

链接https://arxiv.org/abs/2603.15271

作者:Junlong Ke,Zichen Wen,Boxue Yang,Yantai Yang,Xuyang Liu,Chenfei Liao,Zhaorun Chen,Shaobo Wang,Linfeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:face substantial computational, substantial computational overhead, Native unified multimodal, Native unified, face substantial

备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at this https URL.

46. 【2603.15269】Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

链接https://arxiv.org/abs/2603.15269

作者:Kim Ouan,Noémie Moreau,Katarzyna Bozek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:corneal nerve fibers, nerve fibers, corneal nerve, tortuosity heavily rely, tortuosity of corneal

备注: 7 pages, 4 figures

点击查看摘要

Abstract:The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

47. 【2603.15267】Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels

链接https://arxiv.org/abs/2603.15267

作者:Victor Wåhlstrand,Jennifer Alvén,Ida Häggström

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:labels at inference, present a framework, order to improve, improve the performance, object detection

备注: Submitted to MICCAI 2026

点击查看摘要

Abstract:We present a framework to take advantage of existing labels at inference, called \textit{exemplars}, in order to improve the performance of object detection in medical images. The method, \textit{exemplar diffusion}, leverages existing diffusion methods for object detection to enable a training-free approach to adding information of known bounding boxes at test time. We demonstrate that for medical image datasets with clear spatial structure, the method yields an across-the-board increase in average precision and recall, and a robustness to exemplar quality, enabling non-expert annotation. Moreover, we demonstrate how our method may also be used to quantify predictive uncertainty in diffusion detection methods. Source code and data splits openly available online: this https URL

48. 【2603.15263】IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning

链接https://arxiv.org/abs/2603.15263

作者:Konstantinos Almpanakis,Anna Kreshuk

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:capturing semantic features, Self-supervised learning, revolutionized representation learning, Joint-Embedding Architectures, semantic features

备注

点击查看摘要

Abstract:Self-supervised learning (SSL) has revolutionized representation learning, with Joint-Embedding Architectures (JEAs) emerging as an effective approach for capturing semantic features. Existing JEAs rely on implicit or explicit batch interaction -- via negative sampling or statistical regularization -- to prevent representation collapse. This reliance becomes problematic in regimes where batch sizes must be small, such as high-dimensional scientific data, where memory constraints and class imbalance make large, well-balanced batches infeasible. We introduce IConE (Instance-Contrasted Embeddings), a framework that decouples collapse prevention from the training batch size. Rather than enforcing diversity through batch statistics, IConE maintains a global set of learnable auxiliary instance embeddings regularized by an explicit diversity objective. This transfers the anti-collapse mechanism from the transient batch to a dataset-level embedding space, allowing stable training even when batch statistics are unreliable, down to batch size 1. Across diverse 2D and 3D biomedical modalities, IConE outperforms strong contrastive and non-contrastive baselines throughout the small-batch regime (from B=1 to B=64) and demonstrates marked robustness to severe class imbalance. Geometric analysis shows that IConE preserves high intrinsic dimensionality in the learned representations, preventing the collapse observed in existing JEAs as batch sizes shrink.

49. 【2603.15260】AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting

链接https://arxiv.org/abs/2603.15260

作者:Jing Wu,Yang Liu,Lin Zhang,Junbo Zeng,Jiabin Wang,Zi Ye,Guowen Li,Shilei Cao,Jiashun Cheng,Fang Wang,Meng Jin,Yerong Feng,Hong Cheng,Yutong Lu,Haohuan Fu,Juepeng Zheng

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:preserve coherent synoptic, coherent synoptic structures, Accurate weather forecasting, Accurate weather, small one-step errors

备注

点击查看摘要

Abstract:Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.

50. 【2603.15253】HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

链接https://arxiv.org/abs/2603.15253

作者:Kuniaki Saito,Risa Shinoda,Shohei Tanaka,Tosho Hirasawa,Fumio Okura,Yoshitaka Ushiku

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:correctly align image, align image content, vision-language model ability, assesses a vision-language, align image

备注

点击查看摘要

Abstract:Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at this https URL.

51. 【2603.15237】Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

链接https://arxiv.org/abs/2603.15237

作者:Yao Gu,Xiaohao Xu,Yingna Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrate strong general-purpose, Vision-Language Models, physics-grounded anomaly detection, strong general-purpose reasoning, demonstrate strong

备注: Accepted by IEEE ICASSP2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

52. 【2603.15228】HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

链接https://arxiv.org/abs/2603.15228

作者:Xuerui Qiu,Yutao Cui,Guozhen Zhang,Junzhe Li,JiaKui Hu,Xiao Zhang,Yang Li,Songtao Liu,Miles Yang,Yu Shi,Zhao Zhong,Liefeng Bo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Models struggle, Unified Multimodal Models, Multimodal Models, abstract representations needed, Models struggle

备注: Work in progress: We are actively scaling up the models. More updates coming soon

点击查看摘要

Abstract:Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a low-dimensional space to filter noise for robust synthesis, then restores dimensionality to empower complex semantic comprehension. Built upon this foundation, we present HYDRA, a native unified framework integrating perception and generation within a single parameter space. Extensive experiments establish HYDRA as a new state-of-the-art. It sets a benchmark in visual reconstruction (rFID 0.08) and achieves top-tier generation performance on GenEval (0.86), DPG-Bench (86.4), and WISE (0.53), while simultaneously outperforming previous native UMMs by an average of 10.0 points across eight challenging understanding benchmarks.

53. 【2603.15213】racking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift

链接https://arxiv.org/abs/2603.15213

作者:Wooseok Lee,Jin Mo Yang,Saewoong Bahk,Hyung-Sin Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:OOD, deep-learning systems, reliable deployment, deployment of deep-learning, OOD samples

备注

点击查看摘要

Abstract:For reliable deployment of deep-learning systems, out-of-distribution (OOD) detection is indispensable. In the real world, where test-time inputs often arrive as streaming mixtures of in-distribution (ID) and OOD samples under evolving covariate shifts, OOD samples are domain-constrained and bounded by the environment, and both ID and OOD are jointly affected by the same covariate factors. Existing methods typically assume a stationary ID distribution, but this assumption breaks down in such settings, leading to severe performance degradation. We empirically discover that, even under covariate shift, covariate-shifted ID (csID) and OOD (csOOD) samples remain separable along a discriminative axis in feature space. Building on this observation, we propose DART, a test-time, online OOD detection method that dynamically tracks dual prototypes -- one for ID and the other for OOD -- to recover the drifting discriminative axis, augmented with multi-layer fusion and flip correction for robustness. Extensive experiments on a wide range of challenging benchmarks, where all datasets are subjected to 15 common corruption types at severity level 5, demonstrate that our method significantly improves performance, yielding 15.32 percentage points (pp) AUROC gain and 49.15 pp FPR@95TPR reduction on ImageNet-C vs. Textures-C compared to established baselines. These results highlight the potential of the test-time discriminative axis tracking for dependable OOD detection in dynamically changing environments.

54. 【2603.15206】Efficient Document Parsing via Parallel Token Prediction

链接https://arxiv.org/abs/2603.15206

作者:Lei Li,Ze Zhao,Meng Li,Zhongwang Lun,Yi Yuan,Xingjing Lu,Zheng Wei,Jiang Bian,Zang Li

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial vision task, vision task, fundamental yet crucial, crucial vision, revolutionized by vision-language

备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

55. 【2603.15185】What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

链接https://arxiv.org/abs/2603.15185

作者:David Holtz,Niklas Hanselmann,Simon Doll,Marius Cordts,Bernt Schiele

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant attention, scale with data, gained significant, significant attention, potential to learn

备注: To be published in CVPR Findings 2026

点击查看摘要

Abstract:End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird's eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: this https URL

56. 【2603.15168】Multimodal Connectome Fusion via Cross-Attention for Autism Spectrum Disorder Classification Using Graph Learning

链接https://arxiv.org/abs/2603.15168

作者:Ansar Rahman,Hassan Shojaee-Mend,Sepideh Hatamikia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Autism spectrum disorder, complex neurodevelopmental condition, neurodevelopmental condition characterized, Autism spectrum, subtle structural alterations

备注: 29 Pages; 5 Figures

点击查看摘要

Abstract:Autism spectrum disorder (ASD) is a complex neurodevelopmental condition characterized by atypical functional brain connectivity and subtle structural alterations. rs-fMRI has been widely used to identify disruptions in large-scale brain networks, while structural MRI provides complementary information about morphological organization. Despite their complementary nature, effectively integrating these heterogeneous imaging modalities within a unified framework remains challenging. This study proposes a multimodal graph learning framework that preserves the dominant role of functional connectivity while integrating structural imaging and phenotypic information for ASD classification. The proposed framework is evaluated on ABIDE-I dataset. Each subject is represented as a node within a population graph. Functional and structural features are extracted as modality-specific node attributes, while inter-subject relationships are modeled using a pairwise association encoder (PAE) based on phenotypic information. Two Edge Variational GCNs are trained to learn subject-level embeddings. To enable effective multimodal integration, we introduce a novel asymmetric transformer-based cross-attention mechanism that allows functional embeddings to selectively incorporate complementary structural information while preserving functional dominance. The fused embeddings are then passed to a MLP for ASD classification. Using stratified 10-fold cross-validation, the framework achieved an AUC of 87.3% and an accuracy of 84.4%. Under leave-one-site-out cross-validation (LOSO-CV), the model achieved an average cross-site accuracy of 82.0%, outperforming existing methods by approximately 3% under 10-fold cross-validation and 7% under LOSO-CV. The proposed framework effectively integrates heterogeneous multimodal data from the multi-site ABIDE-I dataset, improving automated ASD classification across imaging sites.

57. 【2603.15167】Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

链接https://arxiv.org/abs/2603.15167

作者:Sosuke Yamao,Natsuki Miyahara,Yuankai Qi,Shun Takeuchi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long-term video understanding, large multimodal models, video understanding, Question-guided Multimodal Selective, Question-guided Visual Compression

备注: Accepted to CVPR 2026. The first two authors contributed equally to this work

点击查看摘要

Abstract:In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.

58. 【2603.15166】DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

链接https://arxiv.org/abs/2603.15166

作者:Zhengxu He,Jun Li,Zhijian Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale Vision-Language Models, encode rich multimodal, rich multimodal semantics, Large-scale Vision-Language, encode rich

备注

点击查看摘要

Abstract:Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

59. 【2603.15154】Vision-Language Model Based Multi-Expert Fusion for CT Image Classification

链接https://arxiv.org/abs/2603.15154

作者:Jianfa Bai,Kejin Lu,Runtian Yuan,Qingqiu Li,Jilan Xu,Junlin Hou,Yuejie Zhang,Rui Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hidden test-source identities, multi-institutional settings due, substantial source shift, test-source identities, chest CT remains

备注

点击查看摘要

Abstract:Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage~2a and Stage~2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.

60. 【2603.15153】xtOVSR: Text-Guided Real-World Opera Video Super-Resolution

链接https://arxiv.org/abs/2603.15153

作者:Hua Chang,Xin Xu,Wei Liu,Jiayi Wu,Kui Jiang,Fei Ma,Qi Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:early filming equipment, videos exhibit poor, classic opera videos, opera videos exhibit, exhibit poor visual

备注

点击查看摘要

Abstract:Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at this https URL.

61. 【2603.15150】SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

链接https://arxiv.org/abs/2603.15150

作者:Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Aditya Grover,Jason Kuen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, improves reconstruction fidelity, Cross Entropy Minimization, Stochastic Neighbor Cross, Neighbor Cross Entropy

备注: 21 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.

62. 【2603.15137】Context-Aware Sensor Modeling for Asynchronous Multi-Sensor Tracking in Stone Soup

链接https://arxiv.org/abs/2603.15137

作者:Martin Vonheim Larsen,Kim Mathiassen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real world involves, world involves asynchronous, Multi-sensor tracking, heterogeneous detection performance, real world

备注

点击查看摘要

Abstract:Multi-sensor tracking in the real world involves asynchronous sensors with partial coverage and heterogeneous detection performance. Although probabilistic tracking methods permit detection probability and clutter intensity to depend on state and sensing context, many practical frameworks enforce globally uniform observability assumptions. Under multi-rate and partially overlapping sensing, this simplification causes repeated non-detections from high-rate sensors to erode tracks visible only to low-rate sensors, potentially degrading fusion performance. We introduce DetectorContext, an abstraction for the open-source multi-target tracking framework Stone Soup. DetectorContext exposes detection probability and clutter intensity as state-dependent functions evaluated during hypothesis formation. The abstraction integrates with existing probabilistic trackers without modifying their update equations. Experiments on asynchronous radar-lidar data demonstrate that context-aware modeling restores stable fusion and significantly improves HOTA and GOSPA performance without increasing false tracks.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.15137 [cs.CV]

(or
arXiv:2603.15137v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.15137

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
63. 【2603.15132】WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

链接https://arxiv.org/abs/2603.15132

作者:Hainuo Wang,Mingjia Li,Xiaojie Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent Flow Matching, Flow Matching models, Matching models avoid, manifold severely intertwines, Flow Matching

备注

点击查看摘要

Abstract:While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at this https URL.

64. 【2603.15131】Low-light Image Enhancement with Retinex Decomposition in Latent Space

链接https://arxiv.org/abs/2603.15131

作者:Bolun Zheng,Qingshan Lei,Quan Chen,Qianyu Zhang,Kainan Yu,Xu Jia,Lingyu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inspiring numerous learning-based, Retinex theory, numerous learning-based methods, inspiring numerous, integrate its principles

备注: Submit to IEEE TIP

点击查看摘要

Abstract:Retinex theory provides a principled foundation for low-light image enhancement, inspiring numerous learning-based methods that integrate its principles. However, existing methods exhibits limitations in accurately decomposing reflectance and illumination components. To address this, we propose a Retinex-Guided Transformer~(RGT) model, which is a two-stage model consisting of decomposition and enhancement phases. First, we propose a latent space decomposition strategy to separate reflectance and illumination components. By incorporating the log transformation and 1-pixel offset, we convert the intrinsically multiplicative relationship into an additive formulation, enhancing decomposition stability and precision. Subsequently, we construct a U-shaped component refiner incorporating the proposed guidance fusion transformer block. The component refiner refines reflectance component to preserve texture details and optimize illumination distribution, effectively transforming low-light inputs to normal-light counterparts. Experimental evaluations across four benchmark datasets validate that our method achieves competitive performance in low-light enhancement and a more stable training process.

65. 【2603.15129】Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors

链接https://arxiv.org/abs/2603.15129

作者:Yunuo Chen,Chuqin Zhou,Jiangchuan Li,Xiaoyue Ling,Bing He,Jincheng Dai,Li Song,Guo Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image compression, generative image compression, image diffusion-based ULB-IC, perceptual image compression, http URL

备注

点击查看摘要

Abstract:We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal'' evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed this http URL model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction this http URL contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50\% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.

66. 【2603.15126】A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements

链接https://arxiv.org/abs/2603.15126

作者:Jan Andre Rudolph,Dennis Haitz,Markus Ulrich

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:hand-eye calibration method, ground-observing mobile robots, mobile robots, method for ground-observing, ground-observing mobile

备注: 8 pages; accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

点击查看摘要

Abstract:A novel hand-eye calibration method for ground-observing mobile robots is proposed. While cameras on mobile robots are com- mon, they are rarely used for ground-observing measurement tasks. Laser trackers are increasingly used in robotics for precise localization. A referencing plate is designed to combine the two measurement modalities of laser-tracker 3D metrology and camera- based 2D imaging. It incorporates reflector nests for pose acquisition using a laser tracker and a camera calibration target that is observed by the robot-mounted camera. The procedure comprises estimating the plate pose, the plate-camera pose, and the robot pose, followed by computing the robot-camera transformation. Experiments indicate sub-millimeter repeatability.

67. 【2603.15119】A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation

链接https://arxiv.org/abs/2603.15119

作者:Nevrez Imamoglu,Ali Caglayan,Toru Kouyama

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthetic aperture radar, high noise levels, remains limited due, Masked auto-encoders, aperture radar

备注: 10 pages, 8 figures, 1 Table

点击查看摘要

Abstract:Masked auto-encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR-W-MixMAE, which adds SAR-specific intensity-weighted loss to standard MixMAE for pretraining, we also introduce SAR-W-SimMIM; a weighted variant of SimMIM applied to ALOS-2 single-channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self-supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR-W-MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine-tuning models on satellite imagery pose unique challenges, particularly when developing region-specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS-2 single-channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national-scale foundation model. This dataset was used to pretrain a vision transformer-based autoencoder, with the resulting encoder fine-tuned for semantic segmentation using a task-specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self-supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.

68. 【2603.15118】VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

链接https://arxiv.org/abs/2603.15118

作者:Udi Barzelay,Ophir Azulai,Inbar Shapira,Idan Friedman,Foad Abo Dahood,Madison Lee,Abraham Daniels

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:evaluating multimodal foundation, Reverse Annotation pipeline, multimodal foundation models, government forms, introduce VAREX

备注: 9 pages, 4 figures, 4 tables, plus 12-page supplementary. Dataset: [this https URL](https://huggingface.co/datasets/ibm-research/VAREX) Code: [this https URL](https://github.com/udibarzi/varex-bench)

点击查看摘要

Abstract:We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models =4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

69. 【2603.15110】Sampling-guided exploration of active feature selection policies

链接https://arxiv.org/abs/2603.15110

作者:Gabriel Bernardino,Anders Jonsson,Patrick Clarysse,Nicolas Duchateau

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:machine learning predictive, learning predictive models, predictive models, models is challenging, feature acquisition costs

备注

点击查看摘要

Abstract:Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state's dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.

70. 【2603.15109】PAKAN: Pixel Adaptive Kolmogorov-Arnold Network Modules for Pansharpening

链接https://arxiv.org/abs/2603.15109

作者:Haoyu Zhang,Haojing Chen,Zhen Zhong,Liangjian Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution spatial details, rich spectral information, fuse high-resolution spatial, panchromatic images, multispectral images

备注: 16 pages,5 figures,4 tables

点击查看摘要

Abstract:Pansharpening aims to fuse high-resolution spatial details from panchromatic images with the rich spectral information of multispectral images. Existing deep neural networks for this task typically rely on static activation functions, which limit their ability to dynamically model the complex, non-linear mappings required for optimal spatial-spectral fusion. While the recently introduced Kolmogorov-Arnold Network (KAN) utilizes learnable activation functions, traditional KANs lack dynamic adaptability during inference. To address this limitation, we propose a Pixel Adaptive Kolmogorov-Arnold Network framework. Starting from KAN, we design two adaptive variants: a 2D Adaptive KAN that generates spline summation weights across spatial dimensions and a 1D Adaptive KAN that generates them across spectral channels. These two components are then assembled into PAKAN 2to1 for feature fusion and PAKAN 1to1 for feature refinement. Extensive experiments demonstrate that our proposed modules significantly enhance network performance, proving the effectiveness and superiority of pixel-adaptive activation in pansharpening tasks.

71. 【2603.15100】Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC

链接https://arxiv.org/abs/2603.15100

作者:Alice Natalina Caragliano,Giulia Farina,Fatih Aksu,Camillo Maria Caruso,Claudia Tacconi,Carlo Greco,Lorenzo Nibid,Edy Ippolito,Michele Fiore,Giuseppe Perrone,Sara Ramella,Paolo Soda,Valerio Guarrasi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Major pathological response, cell lung cancer, clinically meaningful endpoint, non-small cell lung, Major pathological

备注

点击查看摘要

Abstract:Major pathological response (pR) following neoadjuvant therapy is a clinically meaningful endpoint in non-small cell lung cancer, strongly associated with improved survival. However, accurate preoperative prediction of pR remains challenging, particularly in real-world clinical settings characterized by limited data availability and incomplete clinical profiles. In this study, we propose a multimodal deep learning framework designed to address these constraints by integrating foundation model-based CT feature extraction with a missing-aware architecture for clinical variables. This approach enables robust learning from small cohorts while explicitly modeling missing clinical information, without relying on conventional imputation strategies. A weighted fusion mechanism is employed to leverage the complementary contributions of imaging and clinical modalities, yielding a multimodal model that consistently outperforms both unimodal imaging and clinical baselines. These findings underscore the added value of integrating heterogeneous data sources and highlight the potential of multimodal, missing-aware systems to support pR prediction under realistic clinical conditions.

72. 【2603.15083】ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

链接https://arxiv.org/abs/2603.15083

作者:Cheng Luo,Bizhu Wu,Bing Li,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen,Bernard Ghanem

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD)

关键词:Listener Motion Generation, naturalistic listener body, generate naturalistic listener, Speaker Utterance, Motion Generation

备注: 42 pages, 11 tables, 8 figures

点击查看摘要

Abstract:In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

73. 【2603.15062】he Good, the Better, and the Best: Improving the Discriminability of Face Embeddings through Attribute-aware Learning

链接https://arxiv.org/abs/2603.15062

作者:Ana Dias,João Ribeiro Pinto,Hugo Proença,João C. Neves

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:robust performance remains, performance remains challenging, attributes, variations in age, facial attributes

备注: Accepted at IWBF 2026

点击查看摘要

Abstract:Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.

74. 【2603.15050】SRL-MAD: Structured Residual Latents for One-Class Morphing Attack Detection

链接https://arxiv.org/abs/2603.15050

作者:Diogo J. Paulo,Hugo Proença,João C. Neves

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Face morphing attacks, morphing attack detection, single face, morphing attacks represent, attack detection

备注: Accepted at IWBF 2026

点击查看摘要

Abstract:Face morphing attacks represent a significant threat to biometric systems as they allow multiple identities to be combined into a single face. While supervised morphing attack detection (MAD) methods have shown promising performance, their reliance on attack-labeled data limits generalization to unseen morphing attacks. This has motivated increasing interest in one-class MAD, where models are trained exclusively on bona fide samples and are expected to detect unseen attacks as deviations from the normal facial structure. In this context, we introduce SRL-MAD, a one-class single-image MAD that uses structured residual Fourier representations for open-set morphing attack detection. Starting from a residual frequency map that suppresses image-specific spectral trends, we preserve the two-dimensional organization of the Fourier domain through a ring-based representation and replace azimuthal averaging with a learnable ring-wise spectral projection. To further encode domain knowledge about where morphing artifacts arise, we impose a frequency-informed inductive bias by organizing spectral evidence into low, mid, and high-frequency bands and learning cross-band interactions. These structured spectral features are mapped into a latent space designed for direct scoring, avoiding the reliance on reconstruction errors. Extensive evaluation on FERET-Morph, FRLL-Morph, and MorDIFF demonstrates that SRL-MAD consistently outperforms recent one-class and supervised MAD models. Overall, our results show that learning frequency-aware projections provides a more discriminative alternative to azimuthal spectral summarization for one-class morphing attack detection.

75. 【2603.15039】GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

链接https://arxiv.org/abs/2603.15039

作者:Yang Li,Yuchen Liu,Haoyu Lu,Zhiqiang Xia,Hongzhen Wang,Kaiyang Han,Changpeng Yang,Jinyang Wu,Jiaming Xu,Runyu Shi,Ying Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Chinese mobile GUI, mobile GUI agents, Multimodal Large

备注: accepted by CVPR 2026

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.

76. 【2603.15026】raining-free Detection of Generated Videos via Spatial-Temporal Likelihoods

链接https://arxiv.org/abs/2603.15026

作者:Omer Ben Hayun,Roy Betser,Meir Yossef Levi,Levi Kassel,Guy Gilboa

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:producing highly realistic, domain has surged, producing highly, controllable sequences, major advances

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at this https URL.

77. 【2603.15025】One CT Unified Model Training Framework to Rule All Scanning Protocols

链接https://arxiv.org/abs/2603.15025

作者:Fengzhi Xu,Ziyuan Yang,Zexin Lu,Yingyu Chen,Fenglei Fan,Hongming Shan,Yi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Non-ideal measurement computed, measurement computed tomography, Non-ideal measurement, computed tomography, image quality

备注

点击查看摘要

Abstract:Non-ideal measurement computed tomography (NICT), which lowers radiation at the cost of image quality, is expanding the clinical use of CT. Although unified models have shown promise in NICT enhancement, most methods require paired data, which is an impractical demand due to inevitable organ motion. Unsupervised approaches attempt to overcome this limitation, but their assumption of homogeneous noise neglects the variability of scanning protocols, leading to poor generalization and potential model collapse. We further observe that distinct scanning protocols, which correspond to different physical imaging processes, produce discrete sub-manifolds in the feature space, contradicting these assumptions and limiting their effectiveness. To address this, we propose an Uncertainty-Guided Manifold Smoothing (UMS) framework to bridge the gaps between sub-manifolds. A classifier in UMS identifies sub-manifolds and predicts uncertainty scores, which guide the generation of diverse samples across the entire manifold. By leveraging the classifier's capability, UMS effectively fills the gaps between discrete sub-manifolds, and promotes a continuous and dense feature space. Due to the complexity of the global manifold, it's hard to directly model it. Therefore, we propose to dynamically incorporate the global- and sub-manifold-specific features. Specifically, we design a global- and sub-manifold-driven architecture guided by the classifier, which enables dynamic adaptation to subdomain variations. This dynamic mechanism improves the network's capacity to capture both shared and domain-specific features, thereby improving reconstruction performance. Extensive experiments on public datasets are conducted to validate the effectiveness of our method across different generation paradigms.

78. 【2603.15020】MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal

链接https://arxiv.org/abs/2603.15020

作者:Yiqi Nie,Fei Wang,Junjie Chen,Kun Li,Yudi Cai,Dan Guo,Chenglong Li,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:jointly convey nuanced, overlaid text jointly, text jointly convey, convey nuanced affect, tightly coupled

备注

点击查看摘要

Abstract:Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: this https URL.

79. 【2603.15019】Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization

链接https://arxiv.org/abs/2603.15019

作者:Lehuai Xu,Weiming Zhang,Yang Li,Sidan Du,Lin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-fisheye stereo matching, Reliable omnidirectional depth, stereo matching, embodied robotics, Reliable omnidirectional

备注: 8 pages, 5 figures

点击查看摘要

Abstract:Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.

80. 【2603.15016】Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching

链接https://arxiv.org/abs/2603.15016

作者:Fangran Miao,Jian Huang,Ting Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:Human motion generation, structured non-Euclidean geometry, follow structured non-Euclidean, Euclidean spaces, valid motions follow

备注: 18 pages, 6 figures

点击查看摘要

Abstract:Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.

81. 【2603.15011】Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

链接https://arxiv.org/abs/2603.15011

作者:Jiahe Song,Chuang Wang,Yinfan Wang,Hao Zheng,Rui Nie,Bowen Jiang,Xingjian Wei,Junyuan Gao,Yubin Wang,Bin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extracting chemical synthesis, chemical synthesis information, information from literature, critical for extracting, synthesis information

备注

点击查看摘要

Abstract:Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.

82. 【2603.15008】Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

链接https://arxiv.org/abs/2603.15008

作者:Kaixin zhang,Xiaohe Li,Jiahao Li,Haohua Wu,Xinyu Zhao,Zide Fan,Lei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal Large Language, Large Language Models, Video Question Answering, Multi-modal Large, Question Answering

备注: 18 pages, 7 figures

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

83. 【2603.15003】Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning

链接https://arxiv.org/abs/2603.15003

作者:Nasrin Rahimi,Mısra Yavuz,Burak Can Biner,Yunus Bilge Kurt,Ahmet Rasim Emirdağı,Süleyman Aslan,Görkay Aydemir,M. Akın Yılmaz,A. Murat Tekalp

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Pre-trained image editing, object-aware transformation capabilities, transformation capabilities acquired, exhibit strong spatial, explicit temporal modeling

备注

点击查看摘要

Abstract:Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model's inherent understanding of "how objects transform" in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized

84. 【2603.14998】hermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3

链接https://arxiv.org/abs/2603.14998

作者:Hürkan Şahin,Huy Xuan Pham,Van Huyen Dang,Alper Yegenoglu,Erdal Kayacan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:unmanned aerial vehicles, visually degraded environments, Autonomous navigation, degraded environments remains, environments remains challenging

备注: 8 pages, 8 figures, 2 table

点击查看摘要

Abstract:Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.

85. 【2603.14989】MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

链接https://arxiv.org/abs/2603.14989

作者:Hui Shen,Xin Wang,Ping Zhang,Yunta Hsieh,Qi Han,Zhongwei Wan,Ziheng Zhang,Jingxuan Zhang,Jing Xiong,Ziyuan Liu,Yifan Zhang,Hangrui Cao,Chenyang Zhao,Mi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long multimodal contexts, high inference latency, inference latency due, suffer from high, high inference

备注

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

86. 【2603.14976】Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

链接https://arxiv.org/abs/2603.14976

作者:Lingsi Zhu,Yuefeng Zou,Yunxiang Zhang,Naixiang Zheng,Guoyuan Wang,Jun Yu,Jiaen Liang,Wei Huang,Shengping Liu,Ximin Zheng

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)

关键词:Estimating Emotional Mimicry, Emotional Mimicry Intensity, Mimicry Intensity, Mimicry Intensity estimation, Estimating Emotional

备注

点击查看摘要

Abstract:Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.

87. 【2603.14974】Voronoi-based Second-order Descriptor with Whitened Metric in LiDAR Place Recognition

链接https://arxiv.org/abs/2603.14974

作者:Jaein Kim,Hee Bin Yoo,Dong-Sig Han,Byoung-Tak Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:LiDAR Place Recognition, Place Recognition, aggregating local descriptors, pooling layer plays, LiDAR Place

备注: Accepted at ICRA 26

点击查看摘要

Abstract:The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.

88. 【2603.14965】GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

链接https://arxiv.org/abs/2603.14965

作者:Minjun Kang,Inkyu Shin,Taeyeop Lee,Myungchul Kim,In So Kweon,Kuk-Jin Yoon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesis requires strong, view synthesis requires, generate visually coherent, visually coherent images, requires strong

备注: The code will be available at [this https URL](https://sites.google.com/view/minjun-kang/geonvs-arxiv26)

点击查看摘要

Abstract:Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

89. 【2603.14957】CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

链接https://arxiv.org/abs/2603.14957

作者:Xiaojun Shan,Haoyu Shen,Yucheng Mao,Xiang Zhang,Abhay Anand,Bingnan Li,Haiyang Xu,Zhuowen Tu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single autoregressive framework, autoregressive framework, foundation model capable, single autoregressive, unified vision-language foundation

备注

点击查看摘要

Abstract:We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image-layout-image and layout-image-layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.

90. 【2603.14953】Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

链接https://arxiv.org/abs/2603.14953

作者:Minchan Kwon,Hyounguk Shon,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large multimodal models, video remains challenging, recently demonstrated remarkable, demonstrated remarkable performance, remains challenging due

备注

点击查看摘要

Abstract:Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.

91. 【2603.14952】Pansharpening for Thin-Cloud Contaminated Remote Sensing Images: A Unified Framework and Benchmark Dataset

链接https://arxiv.org/abs/2603.14952

作者:Songcheng Du,Yang Zou,Jiaxin Li,Mingxuan Liu,Ying Li,Changjing Shang,Qiang Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rarely addressed task, cloud-induced spectral distortions, simultaneous spatial resolution, spatial resolution degradation, Thin Cloud Removal

备注: 11 pages,5 figures,published in AAAI2026

点击查看摘要

Abstract:Pansharpening under thin cloudy conditions is a practically significant yet rarely addressed task, challenged by simultaneous spatial resolution degradation and cloud-induced spectral distortions. Existing methods often address cloud removal and pansharpening sequentially, leading to cumulative errors and suboptimal performance due to the lack of joint degradation modeling. To address these challenges, we propose a Unified Pansharpening Model with Thin Cloud Removal (Pan-TCR), an end-to-end framework that integrates physical priors. Motivated by theoretical analysis in the frequency domain, we design a frequency-decoupled restoration (FDR) block that disentangles the restoration of multispectral image (MSI) features into amplitude and phase components, each guided by complementary degradation-robust prompts: the near-infrared (NIR) band amplitude for cloud-resilient restoration, and the panchromatic (PAN) phase for high-resolution structural enhancement. To ensure coherence between the two components, we further introduce an interactive inter-frequency consistency (IFC) module, enabling cross-modal refinement that enforces consistency and robustness across frequency cues. Furthermore, we introduce the first real-world thin-cloud contaminated pansharpening dataset (PanTCR-GF2), comprising paired clean and cloudy PAN-MSI images, to enable robust benchmarking under realistic conditions. Extensive experiments on real-world and synthetic datasets demonstrate the superiority and robustness of Pan-TCR, establishing a new benchmark for pansharpening under realistic atmospheric degradations.

92. 【2603.14951】GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM

链接https://arxiv.org/abs/2603.14951

作者:Guohua Zhang,Jian Jin,Meiqin Liu,Chao Yao,Weisi Lin,Yao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multi-modal Large Language, Image Quality Assessment, Language Models, Multi-modal Large

备注

点击查看摘要

Abstract:With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.

93. 【2603.14948】Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

链接https://arxiv.org/abs/2603.14948

作者:Xingtai Gui,Meijie Zhang,Tianyi Yan,Wencheng Han,Jiahao Gong,Feiyang Tan,Cheng-zhong Xu,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:raw sensor input, Driving world models, World Model, Driving world, Driving World Model

备注: 16 pages, 9 figures. The code is available at [this https URL](https://github.com/TabGuigui/WorldDrive)

点击查看摘要

Abstract:End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

94. 【2603.14938】FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

链接https://arxiv.org/abs/2603.14938

作者:Yaoru Li,Federico Landi,Marco Godi,Xin Jin,Ruiju Fu,Yufei Ma,Muyang Sun,Heyu Si,Qi Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systems remain fundamentally, remain fundamentally constrained, driving systems remain, autonomous driving, interactive simulation environments

备注

点击查看摘要

Abstract:Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.

95. 【2603.14936】Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework

链接https://arxiv.org/abs/2603.14936

作者:Wenxi Wang,Hongbin Liu,Mingqian Li,Junyan Yuan,Junqi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, remarkable success, achieved remarkable, RFD, diffusion models

备注

点击查看摘要

Abstract:Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user's true visual intent, significantly outperforming baselines in preference alignment.

96. 【2603.14935】Video-CoE: Reinforcing Video Event Prediction via Chain of Events

链接https://arxiv.org/abs/2603.14935

作者:Qile Su,Jing Tang,Rui Chen,Lei Sun,Xiangxiang Chu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:VEP task, remains relatively underexplored, VEP, future events, textbf

备注: 21 pages, 18 figures, 6 tables

点击查看摘要

Abstract:Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

97. 【2603.14925】Workflow-Aware Structured Layer Decomposition for Illustration Production

链接https://arxiv.org/abs/2603.14925

作者:Tianyu Zhang,Dongchi Li,Keiichi Sawada,Haoran Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Recent generative image, Recent generative, improve controllability, typically relying, object-based segmentation

备注: 17 pages, 15 figures

点击查看摘要

Abstract:Recent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: this https URL

98. 【2603.14920】$\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling

链接https://arxiv.org/abs/2603.14920

作者:Huanjing Yue,Dawei Li,Shaoxiong Tu,Jingyu Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High Dynamic Range, Low Dynamic Range, Reconstructing High Dynamic, alternating-exposure Low Dynamic, Reconstructing High

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\text{F}^2\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that $\text{F}^2\text{HDR}$ achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations.

99. 【2603.14916】EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing

链接https://arxiv.org/abs/2603.14916

作者:Zitong Xu,Huiyu Duan,Zhongpeng Ji,Xinyun Zhang,Yutao Liu,Xiongkuo Min,Ke Gu,Jian Zhang,Shusong Xu,Jinwei Chen,Bo Li,Guangtao Zhai

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:achieved remarkable progress, Recent text-guided image, image editing, Recent text-guided, unaesthetic contents

备注

点击查看摘要

Abstract:Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: this https URL.

100. 【2603.14915】ILV: Iterative Latent Volumes for Fast and Accurate Sparse-View CT Reconstruction

链接https://arxiv.org/abs/2603.14915

作者:Seungryong Lee,Woojeong Baek,Joosang Lee,Eunbyung Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lowering system cost, reducing radiation exposure, enabling timely imaging, timely imaging, radiation exposure

备注: Project page: \url{ [this https URL](https://sngryonglee.github.io/ILV/) }

点击查看摘要

Abstract:A long-term goal in CT imaging is to achieve fast and accurate 3D reconstruction from sparse-view projections, thereby reducing radiation exposure, lowering system cost, and enabling timely imaging in clinical workflows. Recent feed-forward approaches have shown strong potential toward this overarching goal, yet their results still suffer from artifacts and loss of fine details. In this work, we introduce Iterative Latent Volumes (ILV), a feed-forward framework that integrates data-driven priors with classical iterative reconstruction principles to overcome key limitations of prior feed-forward models in sparse-view CBCT reconstruction. At its core, ILV constructs an explicit 3D latent volume that is repeatedly updated by conditioning on multi-view X-ray features and the learned anatomical prior, enabling the recovery of fine structural details beyond the reach of prior feed-forward models. In addition, we develop and incorporate several key architectural components, including an X-ray feature volume, group cross-attention, efficient self-attention, and view-wise feature aggregation, that efficiently realize its core latent volume refinement concept. Extensive experiments on a large-scale dataset of approximately 14,000 CT volumes demonstrate that ILV significantly outperforms existing feed-forward and optimization-based methods in both reconstruction quality and speed. These results show that ILV enables fast and accurate sparse-view CBCT reconstruction suitable for clinical use. The project page is available at: this https URL.

101. 【2603.14909】opoVST: Toward Topology-fidelitous Vessel Skeleton Tracking

链接https://arxiv.org/abs/2603.14909

作者:Yaoyu Liu,Minghui Zhang,Junjun He,Yun Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatic extraction, clinical applications, Automatic, vessel, vessel skeletons

备注: 10 pages, 9 figures. Under Review

点击查看摘要

Abstract:Automatic extraction of vessel skeletons is crucial for many clinical applications. However, achieving topologically faithful delineation of thin vessel skeletons remains highly challenging, primarily due to frequent discontinuities and the presence of spurious skeleton segments. To address these difficulties, we propose TopoVST, a topology-fidelitious vessel skeleton tracker. TopoVST constructs multi-scale sphere graphs to sample the input image and employs graph neural networks to jointly estimate tracking directions and vessel radii. The utilization of multi-scale representations is enhanced through a gating-based feature fusion mechanism, while the issue of class imbalance during training is mitigated by embedding a geometry-aware weighting scheme into the directional loss. In addition, we design a wave-propagation-based skeleton tracking algorithm that explicitly mitigates the generation of spurious skeletons through space-occupancy filtering. We evaluate TopoVST on two vessel datasets with different geometries. Extensive comparisons with state-of-the-art baselines demonstrate that TopoVST achieves competitive performance in both overlapping and topological metrics. Our source code is available at: this https URL.

102. 【2603.14908】PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning

链接https://arxiv.org/abs/2603.14908

作者:Yinfeng Gao,Qichao Zhang,Deqing Liu,Zhongpu Xia,Guang Li,Kun Ma,Guang Chen,Hangjun Ye,Long Chen,Da-Wei Ding,Dongbin Zhao

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Imitation Learning, real driving requirements, open-loop training objectives, inadequate open-loop training, autonomous driving policies

备注: Accepted by IEEE RA-L. Submitted: 2025.12.2; Revised: 2026.2.4; Accepeted: 2026.3.7

点击查看摘要

Abstract:End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.

103. 【2603.14892】Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs

链接https://arxiv.org/abs/2603.14892

作者:Jaehoon Lee,Mingi Jung,Soohyuk Jang,Seungryong Yoo,Dahuin Jung,Sungroh Yoon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, resulting large number, major computational bottleneck, multimodal understanding capabilities, Vision-Language Models

备注

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.

104. 【2603.14886】PASTE: Physics-Aware Scattering Topology Embedding Framework for SAR Object Detection

链接https://arxiv.org/abs/2603.14886

作者:Jiacheng Chen,Yuxuan Xiong,Haipeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, Current deep learning-based, Aperture Radar, Synthetic Aperture, deep learning-based object

备注

点击查看摘要

Abstract:Current deep learning-based object detection for Synthetic Aperture Radar (SAR) imagery mainly adopts optical image methods, treating targets as texture patches while ignoring inherent electromagnetic scattering mechanisms. Though scattering points have been studied to boost detection performance, most methods still rely on amplitude-based statistical models. Some approaches introduce frequency-domain information for scattering center extraction, but they suffer from high computation cost and poor compatibility with diverse datasets. Thus, effectively embedding scattering topological information into modern detection frameworks remains challenging. To solve these problems, this paper proposes the Physics-Aware Scattering Topology Embedding Framework (PASTE), a novel closed-loop architecture for comprehensive scattering prior integration. By building the full pipeline from topology generation, injection to joint supervision, PASTE elegantly integrates scattering physics into modern SAR detectors. Specifically, it designs a scattering keypoint generation and automatic annotation scheme based on the Attributed Scattering Center (ASC) model to produce scalable and physically consistent priors. A scattering topology injection module guides multi-scale feature learning, and a scattering prior supervision strategy constrains network optimization by aligning predictions with scattering center distributions. Experiments on real datasets show that PASTE is compatible with various detectors and brings relative mAP gains of 2.9% to 11.3% over baselines with acceptable computation overhead. Visualization of scattering maps verifies that PASTE successfully embeds scattering topological priors into feature space, clearly distinguishing target and background scattering regions, thus providing strong interpretability for results.

105. 【2603.14885】SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras

链接https://arxiv.org/abs/2603.14885

作者:Huanjing Yue,Shangbin Xie,Cong Cao,Qian Wu,Lei Zhang,Lei Zhao,Jingyu Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging imaging conditions, rich scene information, scene information compared, RAW images preserve, images preserve superior

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to adapt to different camera-specific ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection. Our code and model are available at this https URL.

106. 【2603.14882】LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

链接https://arxiv.org/abs/2603.14882

作者:Soumyaratna Debnath,Bui Duc Manh,Zinan Liu,Lin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dedicating equal precision, Vision-Language Models, uniform spatial fidelity, typically assume, dedicating equal

备注: CVPR 2026, 10 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

107. 【2603.14880】RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

链接https://arxiv.org/abs/2603.14880

作者:Linfei Li,Lin Zhang,Ying Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:establish semantic correspondences, Visual-language grounding aims, localize target objects, target objects based, visual entities

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in a unified manner given natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code are publicly available at this https URL.

108. 【2603.14861】Video Detector: A Dual-Phase Vision-Based System for Real-Time Traffic Intersection Control and Intelligent Transportation Analysis

链接https://arxiv.org/abs/2603.14861

作者:Mustafa Fatih Şen,Halûk Gümüşkaya,Şenol Pazar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:costly infrastructure modifications, increasingly requires intelligent, requires intelligent sensing, Urban traffic management, management increasingly requires

备注: 18 pages, 10 figures, 4 tables, preprint, the dataset is openly available

点击查看摘要

Abstract:Urban traffic management increasingly requires intelligent sensing systems capable of adapting to dynamic traffic conditions without costly infrastructure modifications. Vision-based vehicle detection has therefore become a key technology for modern intelligent transportation systems. This study presents Video Detector (VD), a dual-phase vision-based traffic intersection management system designed as a flexible and cost-effective alternative to traditional inductive loop detectors. The framework integrates a real-time module (VD-RT) for intersection control with an offline analytical module (VD-Offline) for detailed traffic behavior analysis. Three system configurations were implemented using SSD Inception v2, Faster R-CNN Inception v2, and CenterNet ResNet-50 V1 FPN, trained on datasets totaling 108,000 annotated images across 6-10 vehicle classes. Experimental results show detection performance of up to 90% test accuracy and 29.5 mAP@0.5, while maintaining real-time throughput of 37 FPS on HD video streams. Field deployments conducted in collaboration with Istanbul IT and Smart City Technologies Inc. (ISBAK) demonstrate stable operation under diverse environmental conditions. The system supports virtual loop detection, vehicle counting, multi-object tracking, queue estimation, speed analysis, and multiclass vehicle classification, enabling comprehensive intersection monitoring without the need for embedded road sensors. The annotated dataset and training pipeline are publicly released to support reproducibility. These results indicate that the proposed framework provides a scalable and deployable vision-based solution for intelligent transportation systems and smart-city traffic management.

109. 【2603.14856】From Horizontal to Rotated: Cross-View Object Geo-Localization with Orientation Awareness

链接https://arxiv.org/abs/2603.14856

作者:Chenlin Fu,Ao Gong,Yingying Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Horizontal Bounding Boxes, Rotated Bounding Boxes, Bounding Boxes, aims to precisely, precisely determine

备注

点击查看摘要

Abstract:Cross-View object geo-localization (CVOGL) aims to precisely determine the geographic coordinates of a query object from a ground or drone perspective by referencing a satellite map. Segmentation-based approaches offer high precision but require prohibitively expensive pixel-level annotations, whereas more economical detection-based methods suffer from lower accuracy. This performance disparity in detection is primarily caused by two factors: the poor geometric fit of Horizontal Bounding Boxes (HBoxes) for oriented objects and the degradation in precision due to feature map scaling. Motivated by these, we propose leveraging Rotated Bounding Boxes (RBoxes) as a natural extension of the detection-based paradigm. RBoxes provide a much tighter geometric fit to oriented objects. Building on this, we introduce OSGeo, a novel geo-localization framework, meticulously designed with a multi-scale perception module and an orientation-sensitive head to accurately regress RBoxes. To support this scheme, we also construct and release CVOGL-R, the first dataset with precise RBox annotations for CVOGL. Extensive experiments demonstrate that our OSGeo achieves state-of-the-art performance, consistently matching or even surpassing the accuracy of leading segmentation-based methods but with an annotation cost that is over an order of magnitude lower.

110. 【2603.14851】AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

链接https://arxiv.org/abs/2603.14851

作者:Wenhui Huang,Songyan Zhang,Qihang Huang,Zhidong Wang,Zhiqi Mao,Collister Chua,Zhan Chen,Long Chen,Chen Lv

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Integrating vision-language models, Integrating vision-language, systems has shown, shown promise, promise in improving

备注

点击查看摘要

Abstract:Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{this https URL}{Project Page} for the demonstration videos and qualitative results.

111. 【2603.14850】From Artefact to Insight: Efficient Low-Rank Adaptation of BrushNet for Scanning Probe Microscopy Image Restoration

链接https://arxiv.org/abs/2603.14850

作者:Ziwei Wei,Yao Shen,Wanheng Lu,Ghim Wei Ho,Kaiyang Zeng

类目:Computer Vision and Pattern Recognition (cs.CV); Mesoscale and Nanoscale Physics (cond-mat.mes-hall)

关键词:Scanning Probe Microscopy, Scanning Probe, Probe Microscopy, offers nanoscale resolution, gain induced noise

备注: 37 pages, 7 figures, 7 tables, jounral paper

点击查看摘要

Abstract:Scanning Probe Microscopy or SPM offers nanoscale resolution but is frequently marred by structured artefacts such as line scan dropout, gain induced noise, tip convolution, and phase hops. While most available methods treat SPM artefact removal as isolated denoising or interpolation tasks, the generative inpainting perspective remains largely unexplored. In this work, we introduce a diffusion based inpainting framework tailored to scientific grayscale imagery. By fine tuning less than 0.2 percent of BrushNet weights with rank constrained low rank adaptation (LoRA), we adapt a pretrained diffusion model using only 7390 artefact, clean pairs distilled from 739 experimental scans. On our forthcoming public SPM InpBench benchmark, the LoRA enhanced model lifts the Peak Signal to Noise Ratio or PSNR by 6.61 dB and halves the Learned Perceptual Image Patch Similarity or LPIPS relative to zero-shot inference, while matching or slightly surpassing the accuracy of full retraining, trainable on a single GPU instead of four high-memory cards. The approach generalizes across various SPM image channels including height, amplitude and phase, faithfully restores subtle structural details, and suppresses hallucination artefacts inherited from natural image priors. This lightweight framework enables efficient, scalable recovery of irreplaceable SPM images and paves the way for a broader diffusion model adoption in nanoscopic imaging analysis.

112. 【2603.14848】Personalized Federated Learning with Residual Fisher Information for Medical Image Segmentation

链接https://arxiv.org/abs/2603.14848

作者:Meilu Zhu,Yuxing Li,Zhiwei Wang,Edmund Y. Lam

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:collaboratively train machine, Federated learning enables, learning enables multiple, train machine learning, Federated learning

备注: accepted by ISBI 2026

点击查看摘要

Abstract:Federated learning enables multiple clients (institutions) to collaboratively train machine learning models without sharing their private data. To address the challenge of data heterogeneity across clients, personalized federated learning (pFL) aims to learn customized models for each client. In this work, we propose pFL-ResFIM, a novel pFL framework that achieves client-adaptive personalization at the parameter level. Specifically, we introduce a new metric, Residual Fisher Information Matrix (ResFIM), to quantify the sensitivity of model parameters to domain discrepancies. To estimate ResFIM for each client model under privacy constraints, we employ a spectral transfer strategy that generates simulated data reflecting the domain styles of different clients. Based on the estimated ResFIM, we partition model parameters into domain-sensitive and domain-invariant components. A personalized model for each client is then constructed by aggregating only the domain-invariant parameters on the server. Extensive experiments on public datasets demonstrate that pFL-ResFIM consistently outperforms state-of-the-art methods, validating its effectiveness.

113. 【2603.14837】DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery

链接https://arxiv.org/abs/2603.14837

作者:Yifan Yang,Lei Zou,Wenjing Gong,Kani Fu,Zongrong Li,Siqin Wang,Bing Zhou,Heng Cai,Hao Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Analyzing street-view imagery, Analyzing street-view, computer vision models, Contrastive Language-Image Pre-training, hyperlocal damage assessment

备注

点击查看摘要

Abstract:Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.

114. 【2603.14832】Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

链接https://arxiv.org/abs/2603.14832

作者:Tuan-Anh Yang,Bao V. Q. Bui,Chanh-Quang Vo-Van,Truong-Son Hy

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:capture complementary slice-level, deep learning framework, Variance Risk Extrapolation, propose a deep, chest CT scans

备注

点击查看摘要

Abstract:We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at this https URL

115. 【2603.14827】SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

链接https://arxiv.org/abs/2603.14827

作者:Zejian Kang,Kai Zheng,Yuanchen Fei,Wentao Yang,Hongyuan Zou,Xiangru Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial action estimation, explicit semantic interpretability, compact expression spaces, Facial action, lack explicit semantic

备注

点击查看摘要

Abstract:Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.

116. 【2603.14825】wo Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

链接https://arxiv.org/abs/2603.14825

作者:Yewon Han,Yumin Seol,EunGyung Kong,Minsoo Jo,Taesup Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:general visual-grounded reasoning, Large Vision-Language Models, visual-grounded reasoning tasks, Existing jailbreak defence, Large Language Model

备注

点击查看摘要

Abstract:Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.

117. 【2603.14822】RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving

链接https://arxiv.org/abs/2603.14822

作者:Yue Sun,Yeqiang Qian,Zhe Wang,Tianhui Li,Chunxiang Wang,Ming Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse real-world traffic, autonomous driving systems, real-world traffic conditions, Reliable perception, essential for autonomous

备注

点击查看摘要

Abstract:Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The "X" highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.

118. 【2603.14819】RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

链接https://arxiv.org/abs/2603.14819

作者:Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieved remarkable success, efficiently removing undesirable, One-step Optimized Retentive, Transformer based diffusion, Optimized Retentive unlearning

备注: 18 pages, 6 figures, 8 tables, accepted to the CVPR 2026 and to appear in the Findings Track Proceedings of IEEE/CVF Conference

点击查看摘要

Abstract:Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance. We introduce Ratio-Aware Zero/One-step Optimized Retentive unlearning (RAZOR), a lightweight, model-agnostic unlearning framework that generalizes forgetting updates to coordinated multi-layer and multi-head edits within transformer backbones. RAZOR identifies the most important layers and attention heads by measuring how much they contribute to forgetting the target data while preserving useful knowledge. Then, it updates these parts of the model using a carefully regularized rule to avoid harming overall performance. The set of edited components grows gradually, ensuring precise unlearning without over-editing or damaging unrelated capabilities. We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks. Our results show that RAZOR achieves highly accurate and stable forgetting, even under quantization. This approach offers stronger retention and better efficiency than prior methods. Notably, it also operates significant faster than conventional techniques. These results demonstrate that RAZOR is a practical and scalable solution for safe, adaptive unlearning in transformer-based vision models.

119. 【2603.14816】M2IR: Proactive All-in-One Image Restoration via Mamba-style Modulation and Mixture-of-Experts

链接https://arxiv.org/abs/2603.14816

作者:Shiwei Wang,Yongzhen Wang,Bingwen Hu,Liyan Zhang,Xiao-Ping Zhang,Mingqiang Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain fundamentally reactive, dominated recent advances, Transformer-based architectures, fundamentally reactive, architectures have dominated

备注

点击查看摘要

Abstract:While Transformer-based architectures have dominated recent advances in all-in-one image restoration, they remain fundamentally reactive: propagating degradations rather than proactively suppressing them. In the absence of explicit suppression mechanisms, degraded signals interfere with feature learning, compelling the decoder to balance artifact removal and detail preservation, thereby increasing model complexity and limiting adaptability. To address these challenges, we propose M2IR, a novel restoration framework that proactively regulates degradation propagation during the encoding stage and efficiently eliminates residual degradations during decoding. Specifically, the Mamba-Style Transformer (MST) block performs pixel-wise selective state modulation to mitigate degradations while preserving structural integrity. In parallel, the Adaptive Degradation Expert Collaboration (ADEC) module utilizes degradation-specific experts guided by a DA-CLIP-driven router and complemented by a shared expert to eliminate residual degradations through targeted and cooperative restoration. By integrating the MST block and ADEC module, M2IR transitions from passive reaction to active degradation control, effectively harnessing learned representations to achieve superior generalization, enhanced adaptability, and refined recovery of fine-grained details across diverse all-in-one image restoration benchmarks. Our source codes are available at this https URL.

120. 【2603.14811】Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

链接https://arxiv.org/abs/2603.14811

作者:Heng Zhou,Li Kang,Yiran Qin,Xiufeng Song,Ao Yu,Zilu Zhang,Haoming Song,Kaixin Xu,Yuchen Fan,Dongzhan Zhou,Xiaohong Liu,Ruimao Zhang,Philip Torr,Lei Bai,Zhenfei Yin

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:embodied multi-agent systems, multi-agent systems, fundamental challenge, partial viewpoints, world from distributed

备注

点击查看摘要

Abstract:Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.

121. 【2603.14807】HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

链接https://arxiv.org/abs/2603.14807

作者:Kailin Lyu,Kangyi Wu,Pengna Li,Xiuyu Hu,Qingyi Si,Cui Miao,Ning Yang,Zihang Wang,Long Xiao,Lianyu Hu,Jingyuan Sun,Ce Hao

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:demonstrated impressive zero-shot, demonstrated impressive, impressive zero-shot performance, VLN, impressive zero-shot

备注: 9 pages, 7 figures

点击查看摘要

Abstract:LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at this https URL.

122. 【2603.14796】Global Truncated Loss Minimization for Robust and Threshold-Resilient Geometric Estimation

链接https://arxiv.org/abs/2603.14796

作者:Tianyu Huang,Liangzu Peng,Xinyue Zhang,Tongfan Guan,Jinhu Dong,Haoang Li,Laurent Kneip,Yun-Hui Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:influence of outliers, generally employed, employed to mitigate, mitigate the influence, GTM

备注: 19 pages, 10 figures

点击查看摘要

Abstract:To achieve outlier-robust geometric estimation, robust objective functions are generally employed to mitigate the influence of outliers. The widely used consensus maximization(CM) is highly robust when paired with global branch-and-bound(BnB) search. However, CM relies solely on inlier counts and is sensitive to the inlier threshold. Besides, the discrete nature of CM leads to loose bounds, necessitating extensive BnB iterations and computation cost. Truncated losses(TL), another continuous alternative, leverage residual information more effectively and could potentially overcome these issues. But to our knowledge, no prior work has systematically explored globally minimizing TL with BnB and its potential for enhanced threshold resilience or search efficiency. In this work, we propose GTM, the first unified BnB-based framework for globally-optimal TL loss minimization across diverse geometric problems. GTM involves a hybrid solving design: given an n-dimensional problem, it performs BnB search over an (n-1)-dimensional subspace while the remaining 1D variable is solved by bounding the objective function. Our hybrid design not only reduces the search space, but also enables us to derive Lipschitz-continuous bounding functions that are general, tight, and can be efficiently solved by a classic global Lipschitz solver named DIRECT, which brings further acceleration. We conduct a systematic evaluation on various BnB-based methods for CM and TL on the robust linear regression problem, showing that GTM enjoys remarkable threshold resilience and the highest efficiency compared to baseline methods. Furthermore, we apply GTM on different geometric estimation problems with diverse residual forms. Extensive experiments demonstrate that GTM achieves state-of-the-art outlier-robustness and threshold-resilience while maintaining high efficiency across these estimation tasks.

123. 【2603.14794】Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

链接https://arxiv.org/abs/2603.14794

作者:Ernie Chu,Vishal M. Patel

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:delivering short monologues, conversation remains difficult, portray isolated speakers, isolated speakers delivering, speakers delivering short

备注

点击查看摘要

Abstract:Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

124. 【2603.14790】Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making

链接https://arxiv.org/abs/2603.14790

作者:Shufeng Nan,Mengtian Li,Sixiao Zheng,Yuwei Lu,Han Zhang,Yanwei Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:film production team, collaborative decision-making process, multi-modal agent-driven framework, production team, film production

备注: 10 pages, 4 figures

点击查看摘要

Abstract:We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.

125. 【2603.14781】High-Fidelity 3D Facial Avatar Synthesis with Controllable Fine-Grained Expressions

链接https://arxiv.org/abs/2603.14781

作者:Yikang He,Jichao Zhang,Wei Wang,Nicu Sebe,Yao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:types based, Dual Mappers, Facial expression editing, control facial expressions, expression

备注

点击查看摘要

Abstract:Facial expression editing methods can be mainly categorized into two types based on their architectures: 2D-based and 3D-based methods. The former lacks 3D face modeling capabilities, making it difficult to edit 3D factors effectively. The latter has demonstrated superior performance in generating high-quality and view-consistent renderings using single-view 2D face images. Although these methods have successfully used animatable models to control facial expressions, they still have limitations in achieving precise control over fine-grained expressions. To address this issue, in this paper, we propose a novel approach by simultaneously refining both the latent code of a pretrained 3D-Aware GAN model for texture editing and the expression code of the driven 3DMM model for mesh editing. Specifically, we introduce a Dual Mappers module, comprising Texture Mapper and Emotion Mapper, to learn the transformations of the given latent code for textures and the expression code for meshes, respectively. To optimize the Dual Mappers, we propose a Text-Guided Optimization method, leveraging a CLIP-based objective function with expression text prompts as targets, while integrating a SubSpace Projection mechanism to project the text embedding to the expression subspace such that we can have more precise control over fine-grained expressions. Extensive experiments and comparative analyses demonstrate the effectiveness and superiority of our proposed method.

126. 【2603.14772】Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image

链接https://arxiv.org/abs/2603.14772

作者:Joohyun Kwon,Geonhee Sim,Gyeongsik Moon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rigid joint transformations, methods primarily rely, realistic cloth dynamics, joint transformations, limiting their ability

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.

127. 【2603.14770】AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

链接https://arxiv.org/abs/2603.14770

作者:Longhui Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-person identity-preserving generation, identity-preserving generation requires, generation requires binding, requires binding multiple, binding multiple reference

备注

点击查看摘要

Abstract:Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.

128. 【2603.14765】SSR: A Training-Free Approach for Streaming 3D Reconstruction

链接https://arxiv.org/abs/2603.14765

作者:Hui Deng,Yuxin Mao,Yuxin He,Yuchao Dai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strict latency constraints, stateful recurrent models, demands long-horizon state, reconstruction demands long-horizon, long-horizon state updates

备注: 8 pages

点击查看摘要

Abstract:Streaming 3D reconstruction demands long-horizon state updates under strict latency constraints, yet stateful recurrent models often suffer from geometric drift as errors accumulate over time. We revisit this problem from a Grassmannian manifold perspective: the latent persistent state can be viewed as a subspace representation, i.e., a point evolving on a Grassmannian manifold, where temporal coherence implies the state trajectory should remain on (or near) this this http URL on this view, we propose Self-expressive Sequence Regularization (SSR), a plug-and-play, training-free operator that enforces Grassmannian sequence regularity during this http URL a window of historical states, SSR computes an analytical affinity matrix via the self-expressive property and uses it to regularize the current update, effectively pulling noisy predictions back toward the manifold-consistent trajectory with minimal overhead. Experiments on long-sequence benchmarks demonstrate that SSR consistently reduces drift and improves reconstruction quality across multiple streaming 3D reconstruction tasks.

129. 【2603.14764】opology-Preserving Data Augmentation for Ring-Type Polygon Annotations

链接https://arxiv.org/abs/2603.14764

作者:Sudip Laudari,Sang Hun Baek

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Geometric data augmentation, annotations represent simply, represent simply connected, Geometric data, simply connected regions

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Geometric data augmentation is widely used in segmentation pipelines and typically assumes that polygon annotations represent simply connected regions. However, in structured domains such as architectural floorplan analysis, ring-type regions are often encoded as a single cyclic polygon chain connecting outer and inner boundaries. During augmentation, clipping operations may remove intermediate vertices and disrupt this cyclic connectivity, breaking the structural relationship between the boundaries. In this work, we introduce an order-preserving polygon augmentation strategy that performs transformations in mask space and then projects surviving vertices back into index-space to restore adjacency relations. This repair maintains the original traversal order of the polygon and preserves topological consistency with minimal computational overhead. Experiments demonstrate that the approach reliably restores connectivity, achieving near-perfect Cyclic Adjacency Preservation (CAP) across both single and compound augmentations.

130. 【2603.14763】LiDAR-EVS: Enhance Extrapolated View Synthesis for 3D Gaussian Splatting with Pseudo-LiDAR Supervision

链接https://arxiv.org/abs/2603.14763

作者:Yiming Huang,Xin Kang,Sipeng Zhang,Hongliang Ren,Weihua Zhang,Junjie Lai

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, powerful technique, technique for real-time, autonomous driving, Gaussian

备注: 22 pages, 8 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time LiDAR and camera synthesis in autonomous driving simulation. However, simulating LiDAR with 3DGS remains challenging for extrapolated views beyond the training trajectory, as existing methods are typically trained on single-traversal sensor scans, suffer from severe overfitting and poor generalization to novel ego-vehicle paths. To enable reliable simulation of LiDAR along unseen driving trajectories without external multi-pass data, we present LiDAR-EVS, a lightweight framework for robust extrapolated-view LiDAR simulation in autonomous driving. Designed to be plug-and-play, LiDAR-EVS readily extends to diverse LiDAR sensors and neural rendering baselines with minimal modification. Our framework comprises two key components: (1) pseudo extrapolated-view point cloud supervision with multi-frame LiDAR fusion, view transformation, occlusion curling, and intensity adjustment; (2) spatially-constrained dropout regularization that promotes robustness to diverse trajectory variations encountered in real-world driving. Extensive experiments demonstrate that LiDAR-EVS achieves SOTA performance on extrapolated-view LiDAR synthesis across three datasets, making it a promising tool for data-driven simulation, closed-loop evaluation, and synthetic data generation in autonomous driving systems.

131. 【2603.14750】Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

链接https://arxiv.org/abs/2603.14750

作者:Cailing Han,Zhangbin Li,Jinxing Zhou,Wei Qian,Jingjing Hu,Yanghao Zhou,Zhangling Duan,Dan Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:detect sentiment-relevant segments, Boundary Enhancement Network, weakly-supervised temporal sentiment, temporal sentiment localization, Face-guided Sentiment Boundary

备注

点击查看摘要

Abstract:Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.

132. 【2603.14741】PHAC: Promptable Human Amodal Completion

链接https://arxiv.org/abs/2603.14741

作者:Seung Young Noh,Ju Yong Chang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conditional image generation, offer users limited, users limited control, Conditional image, human-centric applications

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.

133. 【2603.14739】rajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective

链接https://arxiv.org/abs/2603.14739

作者:Yusheng Peng,Gaofeng Zhang,Liping Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:robot navigation, egocentric perspective, autonomous driving, driving and robot, Future trajectory prediction

备注: Accept by ICRA 2026

点击查看摘要

Abstract:Future trajectory prediction of a tracked pedestrian from an egocentric perspective is a key task in areas such as autonomous driving and robot navigation. The challenge of this task lies in the complex dynamic relative motion between the ego-camera and the tracked pedestrian. To address this challenge, we propose an ego-motion-guided trajectory prediction network based on the Mamba model. Firstly, two Mamba models are used as encoders to extract pedestrian motion and ego-motion features from pedestrian movement and ego-vehicle movement, respectively. Then, an ego-motion guided Mamba decoder that explicitly models the relative motion between the pedestrian and the vehicle by integrating pedestrian motion features as historical context with ego-motion features as guiding cues to capture decoded features. Finally, the future trajectory is generated from the decoded features corresponding to the future timestamps. Extensive experiments demonstrate the effectiveness of the proposed model, which achieves state-of-the-art performance on the PIE and JAAD datasets.

134. 【2603.14738】Efficient Event Camera Volume System

链接https://arxiv.org/abs/2603.14738

作者:Juan Camilo Soto,Ian Noronha,Saru Bharti,Upinder Kaur

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:high dynamic range, cameras promise low, Camera Volume System, sparse output challenges, output challenges integration

备注: Accepted to ICRA 2026

点击查看摘要

Abstract:Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce \nameframew (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain's sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7X higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.

135. 【2603.14733】A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

链接https://arxiv.org/abs/2603.14733

作者:Yue Zhang,Liqiang Jing,Jia Li,Yapeng Tian,Xinya Du,Yunhui Guo,Vibhav Gogate

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models

备注

点击查看摘要

Abstract:Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.

136. 【2603.14727】Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach

链接https://arxiv.org/abs/2603.14727

作者:Hasaan Maqsood,Saif Ur Rehman Khan,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:retinopathy screening traditionally, screening traditionally relies, requiring specialized equipment, Diabetic retinopathy screening, resource limited settings

备注

点击查看摘要

Abstract:Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model this http URL-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.

137. 【2603.14726】Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator

链接https://arxiv.org/abs/2603.14726

作者:Gyeongsik Moon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurately recovering hand, Accurately recovering, body context remains, whole-body pose estimation, recovering hand poses

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.

138. 【2603.14707】Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

链接https://arxiv.org/abs/2603.14707

作者:Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:graphical user interfaces, Computer-using agents, act directly, user interfaces, directly on graphical

备注

点击查看摘要

Abstract:Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent's perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent's reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnote{Model, benchmark, and code: this https URL}.

139. 【2603.14706】AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

链接https://arxiv.org/abs/2603.14706

作者:Salim Khazem

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Vision Transformers faces, fixed feature extractor, Vision Transformers, setting adapter capacity, under-addressed issues

备注

点击查看摘要

Abstract:Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: this https URL

140. 【2603.14704】Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning

链接https://arxiv.org/abs/2603.14704

作者:Ping Chen,Xiang Liu,Xingpeng Zhang,Fei Shen,Xun Gong,Zhaoxiang Liu,Zezhou Chen,Huan Hu,Kai Wang,Shiguo Lian

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:content-agnostic sampling schedule, reflexive System, Diffusion models operate, framework enabling System, content-agnostic sampling

备注: 12 figues, 5 tables

点击查看摘要

Abstract:Diffusion models operate in a reflexive System 1 mode, constrained by a fixed, content-agnostic sampling schedule. This rigidity arises from the curse of state dimensionality, where the combinatorial explosion of possible states in the high-dimensional noise manifold renders explicit trajectory planning intractable and leads to systematic computational misallocation. To address this, we introduce Chain-of-Trajectories (CoTj), a train-free framework enabling System 2 deliberative planning. Central to CoTj is Diffusion DNA, a low-dimensional signature that quantifies per-stage denoising difficulty and serves as a proxy for the high-dimensional state space, allowing us to reformulate sampling as graph planning on a directed acyclic graph. Through a Predict-Plan-Execute paradigm, CoTj dynamically allocates computational effort to the most challenging generative phases. Experiments across multiple generative models demonstrate that CoTj discovers context-aware trajectories, improving output quality and stability while reducing redundant computation. This work establishes a new foundation for resource-aware, planning-based diffusion modeling. The code is available at this https URL.

141. 【2603.14702】Fractal Autoregressive Depth Estimation with Continuous Token Diffusion

链接https://arxiv.org/abs/2603.14702

作者:Jinchang Zhang,Xinrou Kang,Guoyu Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Monocular depth estimation, gap between RGB, Monocular depth, inefficient pixel-wise generation, Visual Autoregressive Diffusion

备注

点击查看摘要

Abstract:Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.

142. 【2603.14701】AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild

链接https://arxiv.org/abs/2603.14701

作者:Yiting Wang,Tim Brödermann,Hamed Haghighi,Haonan Zhao,Christos Sakaridis,Kurt Debattista,Valentina Donzella

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing RGB-LiDAR fusion, RGB-LiDAR fusion methods, fusion methods degrade, methods degrade significantly, LiDAR measurements suffer

备注

点击查看摘要

Abstract:Robust depth completion is fundamental to real-world 3D scene understanding, yet existing RGB-LiDAR fusion methods degrade significantly under adverse weather, where both camera images and LiDAR measurements suffer from weather-induced corruption. In this paper, we introduce AURORA-KITTI, the first large-scale multi-modal, multi-weather benchmark for robust depth completion in the wild. We further formulate Depth Completion and Denoising (DCD) as a unified task that jointly reconstructs a dense depth map from corrupted sparse inputs while suppressing weather-induced noise. AURORA-KITTI contains over \textit{82K} weather-consistent RGBL pairs with metric depth ground truth, spanning diverse weather types, three severity levels, day and night scenes, paired clean references, lens occlusion conditions, and textual descriptions. Moreover, we introduce DDCD, an efficient distillation-based baseline that leverages depth foundation models to inject clean structural priors into in-the-wild DCD training. DDCD achieves state-of-the-art performance on AURORA-KITTI and the real-world DENSE dataset while maintaining efficiency. Notably, our results further show that weather-aware, physically consistent data contributes more to robustness than architectural modifications alone. Data and code will be released upon publication.

143. 【2603.14694】Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptation

链接https://arxiv.org/abs/2603.14694

作者:Asmae Mouradi,Shruti Kshirsagar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Rapid structural damage, remote sensing imagery, Rapid structural, timely disaster response, remote sensing

备注

点击查看摘要

Abstract:Rapid structural damage assessment from remote sensing imagery is essential for timely disaster response. Within human-machine systems (HMS) for disaster management, automated damage detection provides decision-makers with actionable situational awareness. However, models trained on multi-disaster benchmarks often underperform in unseen geographic regions due to domain shift - a distributional mismatch between training and deployment data that undermines human trust in automated assessments. We explore a two-stage ensemble approach using supervised domain adaptation (SDA) for building damage classification across four severity classes. The pipeline adapts the xView2 first-place method to the Ida-BD dataset using SDA and systematically investigates the effect of individual augmentation components on classification performance. Comprehensive ablation experiments on the unseen Ida-BD test split demonstrate that SDA is indispensable: removing it causes damage detection to fail entirely. Our pipeline achieves the most robust performance using SDA with unsharp-enhanced RGB input, attaining a Macro-F1 of 0.5552. These results underscore the critical role of domain adaptation in building trustworthy automated damage assessment modules for HMS-integrated disaster response.

144. 【2603.14686】MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

链接https://arxiv.org/abs/2603.14686

作者:Jinguang Tong,Jinbo Wu,Kaisiyuan Wang,Zhelun Shen,Xuan Huang,Mochu Xiang,Xuesong Li,Yingying Li,Haocheng Feng,Chen Zhao,Hang Zhou,Wei He,Chuong Nguyen,Jingdong Wang,Hongdong Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:digital human creation, Human-Object Interaction, expressive digital human, realistic motion remains, human creation

备注

点击查看摘要

Abstract:Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.

145. 【2603.14684】E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction

链接https://arxiv.org/abs/2603.14684

作者:Yunsoo Kim,Changki Sung,Dasol Hong,Hyun Myung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neural radiance fields, radiance fields, view synthesis, emergence of neural, neural radiance

备注: 10 pages, 6 figures, accepted to CVPR 2026

点击查看摘要

Abstract:The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods either assume known poses or rely on depth estimation models that are bounded by their initial observations, failing to generalize as the camera traverses previously unseen regions. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.

146. 【2603.14667】Comparative Analysis of 3D Convolutional and 2.5D Slice-Conditioned U-Net Architectures for MRI Super-Resolution via Elucidated Diffusion Models

链接https://arxiv.org/abs/2603.14667

作者:Hendrik Chiche,Ludovic Corcos,Logan Rouge

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Magnetic resonance imaging, expensive high-field scanners, computationally enhance low-resolution, enhance low-resolution acquisitions, approximate high-resolution quality

备注

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) super-resolution (SR) methods that computationally enhance low-resolution acquisitions to approximate high-resolution quality offer a compelling alternative to expensive high-field scanners. In this work we investigate an elucidated diffusion model (EDM) framework for brain MRI SR and compare two U-Net backbone architectures: (i) a full 3D convolutional U-Net that processes volumetric patches with 3D convolutions and multi-head self-attention, and (ii) a 2.5D slice-conditioned U-Net that super-resolves each slice independently while conditioning on an adjacent slice for inter-slice context. Both models employ continuous-sigma noise conditioning following Karras et al. and are trained on the NKI cohort of the FOMO60K dataset. On a held-out test set of 5 subjects (6 volumes, 993 slices), the 3D model achieves 37.75 dB PSNR, 0.997 SSIM, and 0.020 LPIPS, improving on the off-the-shelf pretrained EDSR baseline (35.57 dB / 0.024 LPIPS) and the 2.5D variant (35.82 dB) across all three metrics under the same test data and degradation pipeline.

147. 【2603.14666】EviATTA: Evidential Active Test-Time Adaptation for Medical Segment Anything Models

链接https://arxiv.org/abs/2603.14666

作者:Jiayi Chen,Yasmeen George,Winston Chong,Jianfei Cai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Segment Anything Models, Deploying foundational medical, foundational medical Segment, Deploying foundational, active test-time adaptation

备注: 10 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Deploying foundational medical Segment Anything Models (SAMs) via test-time adaptation (TTA) is challenging under large distribution shifts, where test-time supervision is often unreliable. While active test-time adaptation (ATTA) introduces limited expert feedback to improve reliability, existing ATTA methods still suffer from unreliable uncertainty estimation and inefficient utilization of sparse annotations. To address these issues, we propose Evidential Active Test-Time Adaptation (EviATTA), which is, to our knowledge, the first ATTA framework tailored for medical SAMs. Specifically, we adopt the Dirichlet-based Evidential Modeling to decompose overall predictive uncertainty into distribution uncertainty and data uncertainty. Building on this decomposition, we design a Hierarchical Evidential Sampling strategy, where image-wise distribution uncertainty is used to select informative shifted samples, while distance-aware data uncertainty guides sparse pixel annotations to resolve data ambiguities. We further introduce Dual Consistency Regularization, which enforces progressive prompt consistency on sparsely labeled samples to better exploit sparse supervision and applies variational feature consistency on unlabeled samples to stabilize adaptation. Extensive experiments on six medical image segmentation datasets demonstrate that EviATTA consistently improves adaptation reliability with minimal expert feedback under both batch-wise and instance-wise test-time adaptation settings.

148. 【2603.14659】VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

链接https://arxiv.org/abs/2603.14659

作者:Daeun Lee,Shoubin Yu,Yue Zhang,Mohit Bansal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:locate and track, track question-relevant evidence, visual, Video reasoning requires, visual prompting

备注: Project website: [this https URL](https://visioncoach.github.io/)

点击查看摘要

Abstract:Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.

149. 【2603.14658】Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos

链接https://arxiv.org/abs/2603.14658

作者:Marco Postiglione,Isabel Gortner,V.S. Subrahmanian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:machine learning problem, remains poorly understood, realistic conditions remains, conditions remains poorly, learning problem

备注

点击查看摘要

Abstract:Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.

150. 【2603.14647】opoCL: Topological Contrastive Learning for Medical Imaging

链接https://arxiv.org/abs/2603.14647

作者:Guangyu Meng,Pengfei Gu,Peixian Liang,John P. Lalor,Erin Wolf Chambers,Danny Z. Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:powerful approach, Contrastive learning, Hierarchical Topology Encoder, unlabeled images, medical image

备注

点击查看摘要

Abstract:Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.

151. 【2603.14645】Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

链接https://arxiv.org/abs/2603.14645

作者:Mang Ning,Mingxiao Li,Le Zhang,Lanmiao Liu,Matthew B. Blaschko,Albert Ali Salah,Itir Onal Ertugrul

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Spectrum Matching, Spectrum Matching Hypothesis, Encoding Spectrum Matching, Decoding Spectrum Matching, variational autoencoders

备注: We use NIPS template for readability reason

点击查看摘要

Abstract:In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emph{Spectrum Matching Hypothesis}: latents with superior diffusability should (i) follow a flattened power-law PSD (\emph{Encoding Spectrum Matching}, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emph{Decoding Spectrum Matching}, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available this https URL.

152. 【2603.14639】Seeing Where to Deploy: Metric RGB-Based Traversability Analysis for Aerial-to-Ground Hidden Space Inspection

链接https://arxiv.org/abs/2603.14639

作者:Seoyoung Lee,Shaekh Mohammad Shithil,Durgakant Pushp,Lantao Liu,Zhangyang Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:requires accessing hidden, accessing hidden spaces, elevated viewpoints, confined infrastructure, entrances are reachable

备注

点击查看摘要

Abstract:Inspection of confined infrastructure such as culverts often requires accessing hidden spaces whose entrances are reachable primarily from elevated viewpoints. Aerial-ground cooperation enables a UAV to deploy a compact UGV for interior exploration, but selecting a suitable deployment region from aerial observations requires metric terrain reasoning involving scale ambiguity, reconstruction uncertainty, and terrain semantics. We present a metric RGB-based geometric-semantic reconstruction and traversability analysis framework for aerial-to-ground hidden space inspection. A feed-forward multi-view RGB reconstruction backbone produces dense geometry, while temporally consistent semantic segmentation yields a 3D semantic map. To enable deployment-relevant measurements without LiDAR-based dense mapping, we introduce an embodied motion prior that recovers metric scale by enforcing consistency between predicted camera motion and onboard platform egomotion. From the metrically grounded reconstruction, we construct a confidence-aware geometric-semantic traversability map and evaluate candidate deployment zones under explicit reachability constraints. Experiments on a tethered UAV-UGV platform demonstrate reliable deployment-zone identification in hidden space scenarios.

153. 【2603.14632】Continual Few-shot Adaptation for Synthetic Fingerprint Detection

链接https://arxiv.org/abs/2603.14632

作者:Joseph Geo Benjamin,Anil K. Jain,Karthik Nandakumar

类目:Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)

关键词:generative artificial intelligence, past decade fueled, artificial intelligence, synthetically generated fingerprint, quality and realism

备注: Accepted in 14th International Workshop on Biometrics and Forensics (IWBF-2026)

点击查看摘要

Abstract:The quality and realism of synthetically generated fingerprint images have increased significantly over the past decade fueled by advancements in generative artificial intelligence (GenAI). This has exacerbated the vulnerability of fingerprint recognition systems to data injection attacks, where synthetic fingerprints are maliciously inserted during enrollment or authentication. Hence, there is an urgent need for methods to detect if a fingerprint image is real or synthetic. While it is straightforward to train deep neural network (DNN) models to classify images as real or synthetic, often such DNN models overfit the training data and fail to generalize well when applied to synthetic fingerprints generated using unseen GenAI models. In this work, we formulate synthetic fingerprint detection as a continual few-shot adaptation problem, where the objective is to rapidly evolve a base detector to identify new types of synthetic data. To enable continual few-shot adaptation, we employ a combination of binary cross-entropy and supervised contrastive (applied to the feature representation) losses and replay a few samples from previously known styles during fine-tuning to mitigate catastrophic forgetting. Experiments based on several DNN backbones (as feature extractors) and a variety of real and synthetic fingerprint datasets indicate that the proposed approach achieves a good trade-off between fast adaptation for detecting unseen synthetic styles and forgetting of known styles.

154. 【2603.14621】A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans

链接https://arxiv.org/abs/2603.14621

作者:Aadit Nilay,Bhavesh Thapar,Anant Agrawal,Mohammad Nayeem Teli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RT-PCR tests suffer, pandemic exposed critical, high false-negative rates, expert radiological interpretation, exposed critical limitations

备注

点击查看摘要

Abstract:The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.

155. 【2603.14610】Make it SING: Analyzing Semantic Invariants in Classifiers

链接https://arxiv.org/abs/2603.14610

作者:Harel Yadid,Meir Yossef Levi,Roy Betser,Guy Gilboa

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:partially rooted, possess invariants, semantic, linear mappings, Null-space Geometry

备注

点击查看摘要

Abstract:All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.

156. 【2603.14609】GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data

链接https://arxiv.org/abs/2603.14609

作者:Roger Ferrod,Maël Lecene,Krishna Sapkota,George Leifman,Vered Silverman,Genady Beryozkin,Sylvain Lobry

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth Observation, translating raw aerial, raw aerial imagery, Precise spatial understanding, Multimodal Large Language

备注

点击查看摘要

Abstract:Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.

157. 【2603.14604】actile Modality Fusion for Vision-Language-Action Models

链接https://arxiv.org/abs/2603.14604

作者:Charlotte Morissette,Amin Abyaneh,Wei-Di Chang,Anas Houssaini,David Meger,Hsiu-Chin Lin,Jonathan Tremblay,Gregory Dudek

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:VLA models, VLA, integrates visual-tactile signals, models, lightweight modality-fusion approach

备注: 19 pages, 5 figures

点击查看摘要

Abstract:We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.

158. 【2603.14587】xel Splatting: Perspective-Stable 3D Pixel Art

链接https://arxiv.org/abs/2603.14587

作者:Dylan Ebert

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pixel art requires, art requires, requires that discrete, camera moves, Rendering

备注: 3 pages, 2 figures

点击查看摘要

Abstract:Rendering 3D scenes as pixel art requires that discrete pixels remain stable as the camera moves. Existing methods snap the camera to a grid. Under orthographic projection, this works: every pixel shifts by the same amount, and a single snap corrects all of them. Perspective breaks this. Pixels at different depths drift at different rates, and no single snap corrects all depths. Texel splatting avoids this entirely. Scene geometry is rendered into a cubemap from a fixed point in the world, and each texel is splatted to the screen as a world-space quad. Cubemap indexing gives rotation invariance. Grid-snapping the origin gives translation invariance. The primary limitation is that a fixed origin cannot see all geometry; disocclusion at probe boundaries remains an open tradeoff.

159. 【2603.14579】Medical Image Spatial Grounding with Semantic Sampling

链接https://arxiv.org/abs/2603.14579

作者:Andrew Seohwan Yu,Mohsen Hariri,Kunio Nakamura,Mingrui Yang,Xiaojuan Li,Vipin Chaudhary

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:shown significant promise, textbf, Vision language models, shown significant, significant promise

备注: 10 pages, 2 figures, under review at MICCAI 2026

点击查看摘要

Abstract:Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce \textbf{MIS-Ground}, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of \textbf{M}edical \textbf{I}mage \textbf{S}patial \textbf{Ground}ing. We release MIS-Ground to the public at \href{this https URL}{\texttt{this http URL}}. In addition, we present \textbf{MIS-SemSam}, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of \textbf{Sem}antic \textbf{Sam}pling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06\%.

160. 【2603.14577】Covariance-Guided Resource Adaptive Learning for Efficient Edge Inference

链接https://arxiv.org/abs/2603.14577

作者:Ahmad N. L. Nabhaan,Zaki Sukma,Rakandhiya D. Rachmanto,Muhammad Husni Santriaji,Byungjin Cho,Arief Setyanto,In Kee Kim

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning inference, deep learning, learning inference, inference on edge, operators often struggle

备注: 8 pages, 10 figures

点击查看摘要

Abstract:For deep learning inference on edge devices, hardware configurations achieving the same throughput can differ by 2$\times$ in power consumption, yet operators often struggle to find the efficient ones without exhaustive profiling. Existing approaches often rely on inefficient static presets or require expensive offline profiling that must be repeated for each new model or device. To address this problem, we present CORAL, an online optimization method that discovers near-optimal configurations without offline profiling. CORAL leverages distance covariance to statistically capture the non-linear dependencies between hardware settings, e.g., DVFS and concurrency levels, and performance metrics. Unlike prior work, we explicitly formulate the challenge as a throughput-power co-optimization problem to satisfy power budgets and throughput targets simultaneously. We evaluate CORAL on two NVIDIA Jetson devices across three object detection models ranging from lightweight to heavyweight. In single-target scenarios, CORAL achieves 96% $\unicode{x2013}$ 100% of the optimal performance found by exhaustive search. In strict dual-constraint scenarios where baselines fail or exceed power budgets, CORAL consistently finds proper configurations online with minimal exploration.

161. 【2603.14559】A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy

链接https://arxiv.org/abs/2603.14559

作者:Noha Ghatwary,Jiangbei Yue,Ahmed Elgendy,Hanna Nagdy,Ahmed Galal,Hayam Fathy,Hussein El-Amin,Venkataraman Subramanian,Noor Mohammed,Gilberto Ochoa-Ruiz,Sharib Ali

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Ulcerative Colitis Endoscopic, Ulcerative colitis, chronic mucosal inflammatory, mucosal inflammatory condition, Colitis Endoscopic Index

备注: 11

点击查看摘要

Abstract:Ulcerative colitis (UC) is a chronic mucosal inflammatory condition that places patients at increased risk of colorectal cancer. Colonoscopic surveillance remains the gold standard for assessing disease activity, and reporting typically relies on standardised endoscopic scoring metrics. The most widely used is the Mayo Endoscopic Score (MES), with some centres also adopting the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Both are descriptive assessments of mucosal inflammation (MES: 0 to 3; UCEIS: 0 to 8), where higher values indicate more severe disease. However, computational methods for automatically predicting these scores remain limited, largely due to the lack of publicly available expert-annotated datasets and the absence of robust benchmarking. There is also a significant research gap in generating clinically meaningful descriptions of UC images, despite image captioning being a well-established computer vision task. Variability in endoscopic systems and procedural workflows across centres further highlights the need for multi-centre datasets to ensure algorithmic robustness and generalisability. In this work, we introduce a curated multi-centre, multi-resolution dataset that includes expert-validated MES and UCEIS labels, alongside detailed clinical descriptions. To our knowledge, this is the first comprehensive dataset that combines dual scoring metrics for classification tasks with expert-generated captions describing mucosal appearance and clinically accepted reasoning for image captioning. This resource opens new opportunities for developing clinically meaningful multimodal algorithms. In addition to the dataset, we also provide benchmarking using convolutional neural networks, vision transformers, hybrid models, and widely used multimodal vision-language captioning algorithms.

162. 【2603.14549】ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

链接https://arxiv.org/abs/2603.14549

作者:Surendra Pathak,Bo Han

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Vision-Language Models, demonstrate exceptional multi-modal, exceptional multi-modal capabilities, Vision-Language Models, Large Vision-Language

备注

点击查看摘要

Abstract:While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.

163. 【2603.14536】Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders

链接https://arxiv.org/abs/2603.14536

作者:Jiaming Chu,Tao Wang,Lei Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Variational Autoencoder, obtain compact alternatives, modern generative models, VAE encoder distillation, play a critical

备注

点击查看摘要

Abstract:Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond $256^2$ resolution, the distilled encoder generalizes effectively to $512^2$ resolution inputs, partially inheriting the teacher model's resolution this http URL further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.

164. 【2603.14528】Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events

链接https://arxiv.org/abs/2603.14528

作者:Shuang Guo,Filbert Febryanto,Lei Sun,Guillermo Gallego

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:achieving impressive accuracy, visual foundation models, foundation models pioneered, recent years, visual foundation

备注: 18 pages, 6 figures, 5 tables

点击查看摘要

Abstract:In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.

165. 【2603.14526】LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

链接https://arxiv.org/abs/2603.14526

作者:Zengqun Zhao,Ziquan Liu,Yu Cao,Shaogang Gong,Zhensong Zhang,Jifei Song,Jiankang Deng,Ioannis Patras

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inspired similar explorations, large language models, recent success, large language, inspired similar

备注: Project page: see [this https URL](https://zengqunzhao.github.io/LatSearch)

点击查看摘要

Abstract:The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.

166. 【2603.14523】VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

链接https://arxiv.org/abs/2603.14523

作者:Chaoyang Wang,Wenrui Bao,Sicheng Gao,Bingxin Xu,Yu Tian,Yogesh S. Rawat,Yunhao Ge,Yuzhang Shang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:shown promising capabilities, existing approaches rely, embodied intelligence, rely on text-based, static context

备注: We introduce VLA-Thinker, the first VLA model capable of thinking-with-image reasoning, which models visual perception as a dynamically invocable reasoning action, enabling Multimodal Embodied Chain-of-Thought

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: this https URL .

167. 【2603.14507】Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets

链接https://arxiv.org/abs/2603.14507

作者:Zhuoxuan Peng,Boan Zhu,Xingjian Zhang,Wenying Li,S.-H. Gary Chan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human pose estimation, Current mmWave datasets, human pose, human poses, Current mmWave

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Current mmWave datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, severely hampering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. EMDUL trains a pseudo-label estimator to annotate the unlabeled mmWave data and is able to convert, or translate, a given annotated LiDAR PC to its mmWave counterpart. Expanded with both LiDAR-converted and pseudo-labeled mmWave PCs, our mmWave dataset significantly boosts the performance and generalization ability of all our HPE models, with substantial 15.1% and 18.9% error reductions for in-domain and out-of-domain settings, respectively.

168. 【2603.14505】Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs

链接https://arxiv.org/abs/2603.14505

作者:Yiren Zheng,Shibo Li,Jiaming Liu,Haofan Wang,Yiren Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Current multimodal approaches, Large Language, multimodal approaches predominantly, approaches predominantly treat

备注

点击查看摘要

Abstract:Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.

169. 【2603.14504】rust-Region Noise Search for Black-Box Alignment of Diffusion and Flow Models

链接https://arxiv.org/abs/2603.14504

作者:Niklas Schweiger,Daniel Cremers,Karnik Ram

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly popular approach, inference time, increasingly popular, reward models, models

备注: Preprint (shorter version accepted at ICLR ReaLM-GEN workshop)

点击查看摘要

Abstract:Optimizing the noise samples of diffusion and flow models is an increasingly popular approach to align these models to target rewards at inference time. However, we observe that these approaches are usually restricted to differentiable or cheap reward models, the formulation of the underlying pretrained generative model, or are memory/compute inefficient. We instead propose a simple trust-region based search algorithm (TRS) which treats the pre-trained generative and reward models as a black-box and only optimizes the source noise. Our approach achieves a good balance between global exploration and local exploitation, and is versatile and easily adaptable to various generative settings and reward models with minimal hyperparameter tuning. We evaluate TRS across text-to-image, molecule and protein design tasks, and obtain significantly improved output samples over the base generative models and other inference-time alignment approaches which optimize the source noise sample, or even the entire reverse-time sampling noise trajectories in the case of diffusion models. Our source code is publicly available.

170. 【2603.14503】Mapping Dark-Matter Clusters via Physics-Guided Diffusion Models

链接https://arxiv.org/abs/2603.14503

作者:Diego Royo,Brandon Zhao,Adolfo Muñoz,Diego Gutierrez,Katherine L. Bouman

类目:Computer Vision and Pattern Recognition (cs.CV); Cosmology and Nongalactic Astrophysics (astro-ph.CO)

关键词:distorts background light, Galaxy clusters, dark matter, distorts background, powerful probes

备注: 22 pages, 7 figures. Project page available at: [this https URL](https://graphics.unizar.es/projects/DarkMatterMapping)

点击查看摘要

Abstract:Galaxy clusters are powerful probes of astrophysics and cosmology through gravitational lensing: the clusters' mass, dominated by 85% dark matter, distorts background light. Yet, mass reconstruction lacks the scalability and large-scale benchmarks to process the hundreds of thousands of clusters expected from forthcoming wide-field surveys. We introduce a fully automated method to reconstruct cluster surface mass density from photometry and gravitational lensing observables. Central to our approach is DarkClusters-15k, our new dataset of 15,000 simulated clusters with paired mass and photometry maps, the largest benchmark to date, spanning multiple redshifts and simulation frameworks. We train a plug-and-play diffusion prior on DarkClusters-15k that learns the statistical relationship between mass and light, and draw posterior samples constrained by weak- and strong-lensing observables; this yields principled reconstructions driven by explicit physics, alongside well-calibrated uncertainties. Our approach requires no expert tuning, runs in minutes rather than hours, achieves higher accuracy, and matches expertly-tuned reconstructions of the MACS 1206 cluster. We release our method and DarkClusters-15k to support development and benchmarking for upcoming wide-field cosmological surveys.

171. 【2603.14498】R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

链接https://arxiv.org/abs/2603.14498

作者:Yuhao Zhang,Wanxi Dong,Yue Shi,Yi Liang,Jingnan Gao,Qiaochu Yang,Yaxing Lyu,Zhixuan Liang,Yibin Liu,Congsheng Xu,Xianda Guo,Wei Sui,Yaohui Jin,Xiaokang Yang,Yanyan Xu,Yao Mu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Embodied manipulation requires, execute contact-rich actions, manipulation requires accurate, Embodied manipulation, requires accurate

备注

点击查看摘要

Abstract:Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.

172. 【2603.14497】WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

链接https://arxiv.org/abs/2603.14497

作者:Stefan Englmeier,Katharina Winter,Fabian B. Flohr

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:driving systems depend, surrounding environment, Autonomous driving systems, systems depend, contexts and accurately

备注: 8 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.

173. 【2603.14496】Refining 3D Medical Segmentation with Verbal Instruction

链接https://arxiv.org/abs/2603.14496

作者:Kangxian Xie,Jiancheng Yang,Nandor Pinter,Chao Wu,Behzad Bozorgtabar,Mingchen Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:surgical planning, segmentation is essential, essential for clinical, clinical diagnosis, diagnosis and surgical

备注

点击查看摘要

Abstract:Accurate 3D anatomical segmentation is essential for clinical diagnosis and surgical planning. However, automated models frequently generate suboptimal shape predictions due to factors such as limited and imbalanced training data, inadequate labeling quality, and distribution shifts between training and deployment settings. A natural solution is to iteratively refine the predicted shape based on the radiologists' verbal instructions. However, this is hindered by the scarcity of paired data that explicitly links erroneous shapes to corresponding corrective instructions. As an initial step toward addressing this limitation, we introduce CoWTalk, a benchmark comprising 3D arterial anatomies with controllable synthesized anatomical errors and their corresponding repairing instructions. Building on this benchmark, we further propose an iterative refinement model that represents 3D shapes as vector sets and interacts with textual instructions to progressively update the target shape. Experimental results demonstrate that our method achieves significant improvements over corrupted inputs and competitive baselines, highlighting the feasibility of language-driven clinician-in-the-loop refinement for 3D medical shapes modeling.

174. 【2603.14493】Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

链接https://arxiv.org/abs/2603.14493

作者:He Li,Yuhui Zhang,Xiaohan Wang,Kaifeng Lyu,Serena Yeung-Levy

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, mitigate catastrophic forgetting, simple adjustments, fine-tuning recipes

备注

点击查看摘要

Abstract:The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.

175. 【2603.14482】V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

链接https://arxiv.org/abs/2603.14482

作者:Lorenzo Mur-Labadia,Matthew Muckley,Amir Bar,Mido Assran,Koustuv Sinha,Mike Rabbat,Yann LeCun,Nicolas Ballas,Adrien Bardes

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-quality visual representations, present V-JEPA, strong global scene, retaining strong global, global scene understanding

备注

点击查看摘要

Abstract:We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.14482 [cs.CV]

(or
arXiv:2603.14482v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.14482

Focus to learn more

              arXiv-issued DOI via DataCite</p>
176. 【2603.14475】Wi-Spike: A Low-power WiFi Human Multi-action Recognition Model with Spiking Neural Networks

链接https://arxiv.org/abs/2603.14475

作者:Nengbo Zhang,Yao Ying,Lu Wang,Kaishun Wu,Jieming Ma,Fei Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant attention, significant attention due, privacy-preserving nature, gained significant, non-intrusive and privacy-preserving

备注

点击查看摘要

Abstract:WiFi-based human action recognition (HAR) has gained significant attention due to its non-intrusive and privacy-preserving nature. However, most existing WiFi sensing models predominantly focus on improving recognition accuracy, while issues of power consumption and energy efficiency remain insufficiently discussed. In this work, we present Wi-Spike, a bio-inspired spiking neural network (SNN) framework for efficient and accurate action recognition using WiFi channel state information (CSI) signals. Specifically, leveraging the event-driven and low-power characteristics of SNNs, Wi-Spike introduces spiking convolutional layers for spatio-temporal feature extraction and a novel temporal attention mechanism to enhance discriminative representation. The extracted features are subsequently encoded and classified through spiking fully connected layers and a voting layer. Comprehensive experiments on three benchmark datasets (NTU-Fi-HAR, NTU-Fi-HumanID, and UT-HAR) demonstrate that Wi-Spike achieves competitive accuracy in single-action recognition and superior performance in multi-action recognition tasks. As for energy consumption, Wi-Spike reduces the energy cost by at least half compared with other methods, while still achieving 95.83% recognition accuracy in human activity recognition. More importantly, Wi-Spike establishes a new state-of-the-art in WiFi-based multi-action HAR, offering a promising solution for real-time, energy-efficient edge sensing applications.

177. 【2603.14468】LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

链接https://arxiv.org/abs/2603.14468

作者:Rongyi Yu,Chenyuan Duan,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:increasingly relies, retrieval, long videos, video question answering, retrieval planning

备注: 12 pages, 2 figures, appendix included

点击查看摘要

Abstract:Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

178. 【2603.14460】Inclusive AI for Group Interactions: Predicting Gaze-Direction Behaviors in People with Intellectual and Developmental Disabilities

链接https://arxiv.org/abs/2603.14460

作者:Giulia Huang,Maristella Matera,Micol Spitale

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:hold great promise, support human group, Artificial agents, interactions hold great, human group interactions

备注: Accepted to IEEE FG 2026. Includes the Multi-party Interaction with Intellectual and Developmental Disabilities (MIDD) dataset

点击查看摘要

Abstract:Artificial agents that support human group interactions hold great promise, especially in sensitive contexts such as well-being promotion and therapeutic interventions. However, current systems struggle to mediate group interactions involving people who are not neurotypical. This limitation arises because most AI detection models (e.g., for turn-taking) are trained on data from neurotypical populations. This work takes a step toward inclusive AI by addressing the challenge of eye contact detection, a core component of non-verbal communication, with and for people with Intellectual and Developmental Disabilities. First, we introduce a new dataset, Multi-party Interaction with Intellectual and Developmental Disabilities (MIDD), capturing atypical gaze and engagement patterns. Second, we present the results of a comparative analysis with neurotypical datasets, highlighting differences in class imbalance, speaking activity, gaze distribution, and interaction dynamics. Then, we evaluate classifiers ranging from SVMs to FSFNet, showing that fine-tuning on MIDD improves performance, though notable limitations remain. Finally, we present the insights gathered through a focus group with six therapists to interpret our quantitative findings and understand the practical implications of atypical gaze and engagement patterns. Based on these results, we discuss data-driven strategies and emphasize the importance of feature choice for building more inclusive human-centered tools.

179. 【2603.14452】Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality

链接https://arxiv.org/abs/2603.14452

作者:Wenrui Cai,Zhenyi Lu,Yuzhe Li,Yongchao Feng,Jinqing Zhang,Qingjie Liu,Yunhong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inter-frame relation modeling, possess strong capability, Transformer-based one-stream trackers, advent of Transformer-based, Transformer-based one-stream

备注: 15 pages, 9 figures, 16 tables

点击查看摘要

Abstract:With the advent of Transformer-based one-stream trackers that possess strong capability in inter-frame relation modeling, recent research has increasingly focused on how to introduce spatio-temporal context. However, most existing methods rely on a limited number of historical frames, which not only leads to insufficient utilization of the context, but also inevitably increases the length of input and incurs prohibitive computational overhead. Methods that query an external memory bank, on the other hand, suffer from inadequate fusion between the retrieved spatio-temporal features and the backbone. Moreover, using discrete historical frames as context overlooks the rich dynamics of the target. To address the issues, we propose Uni-MDTrack, which consists of two core components: Memory-Aware Compression Prompt (MCP) module and Dynamic State Fusion (DSF) module. MCP effectively compresses rich memory features into memory-aware prompt tokens, which deeply interact with the input throughout the entire backbone, significantly enhancing the performance while maintaining a stable computational load. DSF complements the discrete memory by capturing the continuous dynamic, progressively introducing the updated dynamic state features from shallow to deep layers, while also preserving high efficiency. Uni-MDTrack also supports unified tracking across RGB, RGB-D/T/E, and RGB-Language modalities. Experiments show that in Uni-MDTrack, training only the MCP, DSF, and prediction head, keeping the proportion of trainable parameters around 30%, yields substantial performance gains, achieves state-of-the-art results on 10 datasets spanning five modalities. Furthermore, both MCP and DSF exhibit excellent generality, functioning as plug-and-play components that can boost the performance of various baseline trackers, while significantly outperforming existing parameter-efficient training approaches.

180. 【2603.14435】End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

链接https://arxiv.org/abs/2603.14435

作者:Haoyu Zhang,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging due, single RGB video, human-object interaction, RGB video, recovering a moving

备注: 23 pages, 7 figures. The project page is available at: [this https URL](https://nianheng.github.io/THO-project/)

点击查看摘要

Abstract:Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a 600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: this https URL

181. 【2603.14426】GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

链接https://arxiv.org/abs/2603.14426

作者:Minghan Li,Tongna Chen,Tianrui Lv,Yishuai Zhang,Suchao An,Guodong Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:leaving temporal reasoning, end-state grounding under-evaluated, single frame, temporal hard negative, dominated by real-world

备注

点击查看摘要

Abstract:Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on this http URL.

182. 【2603.14418】Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

链接https://arxiv.org/abs/2603.14418

作者:Wen Yan,Yipei Wang,Shiqi Huang,Natasha Thorley,Mark Emberton,Vasilis Stavrinides,Yipeng Hu,Dean Barratt

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:prostate lesion segmentation, Label variability, major challenge, challenge for prostate, Label

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent 'clean' lesion mask, and propose a hierarchical expectation-maximisation (HierEM) framework that alternates between: (1) inferring a voxel-wise posterior distribution over the latent mask, and (2) training a CNN using this posterior as a soft target and estimate site-specific sensitivity and specificity under a hierarchical prior. This hierarchical prior decomposes label-quality into a global mean with site- and case-level deviations, reducing site-specific bias by penalising the likelihood term contributed only by site deviations. Experiments on three cohorts demonstrate that the proposed hierarchical EM framework enhances cross-site generalisation compared to state-of-the-art methods. For pooled-dataset evaluation, the per-site mean DSC ranges from 29.50% to 39.69%; for leave-one-site-out generalisation, it ranges from 27.91% to 32.67%, yielding statistically significant improvements over comparison methods (p0.039). The method also produces interpretable per-site latent label-quality estimates (sensitivity alpha ranges from 31.5% to 47.3% at specificity beta approximates 0.99), supporting post-hoc analyses of cross-site annotation variability. These results indicate that explicitly modelling site-dependent annotation can improve cross-site generalisation.

183. 【2603.14416】Histo-MExNet: A Unified Framework for Real-World, Cross-Magnification, and Trustworthy Breast Cancer Histopathology

链接https://arxiv.org/abs/2603.14416

作者:Enam Ahmed Taufika,Md Ahasanul Arafatha,Abhijit Kumar Ghoshb,Md. Tanzim Rezab,Md Ashad Alamc

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:breast cancer diagnosis, reliable histopathological image, histopathological image classification, Accurate and reliable, cancer diagnosis

备注: 34, 6 figures

点击查看摘要

Abstract:Accurate and reliable histopathological image classification is essential for breast cancer diagnosis. However, many deep learning models remain sensitive to magnification variability and lack interpretability. To address these challenges, we propose Histo-MExNet, a unified framework designed for scaleinvariant and uncertainty-aware classification. The model integrates DenseNet, ConvNeXt, and EfficientNet backbones within a gated multi-expert architecture, incorporates a prototype learning module for example-driven interpretability, and applies physics-informed regularization to enforce morphology preservation and spatial coherence during feature learning. Monte Carlo Dropout is used to quantify predictive uncertainty. On the BreaKHis dataset, Histo-MExNet achieves 96.97% accuracy under multi-magnification training and demonstrates improved generalization to unseen magnification levels compared to single-expert models, while uncertainty estimation helps identify out-of-distribution samples and reduce overconfident errors, supporting a balanced combination of accuracy, robustness, and interpretability for clinical decision support.

184. 【2603.14412】G-ZAP: A Generalizable Zero-Shot Framework for Arbitrary-Scale Pansharpening

链接https://arxiv.org/abs/2603.14412

作者:Zhiqi Yang,Shan Yin,Jingze Liang,Liang-Jian Deng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:http URL, http URL zero-shot, http URL address, http URL experiments, high-resolution multispectral

备注

点击查看摘要

Abstract:Pansharpening aims to fuse a high-resolution panchromatic (PAN) image and a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Recent deep models have achieved strong performance, yet they typically rely on large-scale pretraining and often generalize poorly to unseen real-world image this http URL zero-shot approaches improve real-scene generalization but require per-image optimization, hindering weight reuse, and the above methods are usually limited to a fixed this http URL address this issue, we propose G-ZAP, a generalizable zero-shot framework for arbitrary-scale pansharpening, designed to handle cross-resolution, cross-scene, and cross-sensor generalization.G-ZAP adopts a feature-based implicit neural representation (INR) fusion network as the backbone and introduces a multi-scale, semi-supervised training scheme to enable robust this http URL experiments on multiple real-world datasets show that G-ZAP achieves state-of-the-art results under PAN-scale fusion in both visual quality and quantitative this http URL, G-ZAP supports weight reuse across image pairs while maintaining competitiveness with per-pair retraining, demonstrating strong potential for efficient real-world deployment.

185. 【2603.14409】PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis

链接https://arxiv.org/abs/2603.14409

作者:Mritula Chandrasekaran,Sanket Kachole,Jarek Francik,Dimitrios Makris

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Pathological Gait-conditioned Generative, variable clinical datasets, diverse gait impairments, Generative Adversarial Network, Gait-conditioned Generative Adversarial

备注

点击查看摘要

Abstract:Pathological gait analysis is constrained by limited and variable clinical datasets, which restrict the modeling of diverse gait impairments. To address this challenge, we propose a Pathological Gait-conditioned Generative Adversarial Network (PGcGAN) that synthesises pathology-specific gait sequences directly from observed 3D pose keypoint trajectories data. The framework incorporates one-hot encoded pathology labels within both the generator and discriminator, enabling controlled synthesis across six gait categories. The generator adopts a conditional autoencoder architecture trained with adversarial and reconstruction objectives to preserve structural and temporal gait characteristics. Experiments on the Pathological Gait Dataset demonstrate strong alignment between real and synthetic sequences through PCA and t-SNE analyses, visual kinematic inspection, and downstream classification tasks. Augmenting real data with synthetic sequences improved pathological gait recognition across GRU, LSTM, and CNN models, indicating that pathology-conditioned gait synthesis can effectively support data augmentation in pathological gait analysis.

186. 【2603.14401】OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

链接https://arxiv.org/abs/2603.14401

作者:Kuanning Wang,Ke Fan,Yuqian Fu,Siyu Lin,Hu Luo,Daniel Seita,Yanwei Fu,Yu-Gang Jiang,Xiangyang Xue

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:framework for video-based, transfer that learns, learns directly, enable robust manipulation, human demonstration videos

备注: Project page: [this https URL](https://sressers.github.io/OCRA/)

点击查看摘要

Abstract:We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

187. 【2603.14382】StAR: Segment Anything Reasoner

链接https://arxiv.org/abs/2603.14382

作者:Seokju Yun,Dongheon Lee,Noori Bae,Jaesung Jun,Chanseul Cho,Youngmin Ro

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex real-world environments, perform holistic reasoning, real-world environments, increasingly important, integrated more rapidly

备注: Code: [this https URL](https://github.com/ysj9909/StAR)

点击查看摘要

Abstract:As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.

188. 【2603.14377】LoCAtion: Long-time Collaborative Attention Framework for High Dynamic Range Video Reconstruction

链接https://arxiv.org/abs/2603.14377

作者:Qianyu Zhang,Bolun Zheng,Lingyu Zhu,Aiai Huang,Zongpeng Li,Shiqi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Prevailing High Dynamic, High Dynamic Range, Prevailing High, Dynamic Range, High Dynamic

备注

点击查看摘要

Abstract:Prevailing High Dynamic Range (HDR) video reconstruction methods are fundamentally trapped in a fragile alignment-and-fusion paradigm. While explicit spatial alignment can successfully recover fine details in controlled environments, it becomes a severe bottleneck in unconstrained dynamic scenes. By forcing rigid alignment across unpredictable motions and varying exposures, these methods inevitably translate registration errors into severe ghosting artifacts and temporal flickering. In this paper, we rethink this conventional prerequisite. Recognizing that explicit alignment is inherently vulnerable to real-world complexities, we propose LoCAtion, a Long-time Collaborative Attention framework that reformulates HDR video generation from a fragile spatial warping task into a robust, alignment-free collaborative feature routing problem. Guided by this new formulation, our architecture explicitly decouples the highly entangled reconstruction task. Rather than struggling to rigidly warp neighboring frames, we anchor the scene on a continuous medium-exposure backbone and utilize collaborative attention to dynamically harvest and inject reliable irradiance cues from unaligned exposures. Furthermore, we introduce a learned global sequence solver. By leveraging bidirectional context and long-range temporal modeling, it propagates corrective signals and structural features across the entire sequence, inherently enforcing whole-video coherence and eliminating jitter. Extensive experiments demonstrate that LoCAtion achieves state-of-the-art visual quality and temporal stability, offering a highly competitive balance between accuracy and computational efficiency.

189. 【2603.14375】he Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics

链接https://arxiv.org/abs/2603.14375

作者:Xiangbo Gao,Mingyang Wu,Siyuan Yang,Jiongze Yu,Pardis Taghavi,Fangzhou Lin,Zhengzhong Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:simulation requires mastering, remarkable visual realism, achieved remarkable visual, physical simulation requires, recent generative video

备注

点击查看摘要

Abstract:While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is this https URL.

190. 【2603.14367】HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

链接https://arxiv.org/abs/2603.14367

作者:Xiaoya Lu,Yijin Zhou,Zeren Chen,Ruocheng Wang,Bingrui Sima,Enshen Zhou,Lu Sheng,Dongrui Liu,Jing Shao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:empower embodied agents, execute complex instructions, subtle environmental states, empower embodied, complex instructions

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under this https URL

191. 【2603.14366】Representation Alignment for Just Image Transformers is not Easier than You Think

链接https://arxiv.org/abs/2603.14366

作者:Jaeyo Shin,Jiwook Kim,Hyunjung Shim

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:accelerate Diffusion Transformers, pixel-space diffusion transformers, Diffusion Transformers, Diffusion Transformers training, accelerate Diffusion

备注: Code: [this https URL](https://github.com/kaist-cvml/PixelREPA)

点击查看摘要

Abstract:Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B$/16$ and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet $256 \times 256$, while achieving $ 2\times$ faster convergence. Finally, PixelREPA-H$/16$ achieves FID$=1.81$ and IS$=317.2$. Our code is available at this https URL.

192. 【2603.14363】AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

链接https://arxiv.org/abs/2603.14363

作者:Peng Xu,Zhengnan Deng,Jiayan Deng,Zonghua Gu,Shaohua Wan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, VLN, Vision-Language Navigation

备注: 18 pages, 4 figures. Code and demo videos will be available at: [this https URL](https://github.com/XuPeng23/AerialVLA)

点击查看摘要

Abstract:Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.

193. 【2603.14361】BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy

链接https://arxiv.org/abs/2603.14361

作者:Alexandre Pereira,Bruno Fernandes,Pablo Barros

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recognizing complex behavioral, naturalistic video settings, video settings remains, Recognizing complex, complex behavioral states

备注: 5 pages, 2 figures, 3 tables, Ambivalence/Hesitancy (AH) Video Recognition Challenge, ABAW10th, CVPR2026

点击查看摘要

Abstract:Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.

194. 【2603.14342】AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

链接https://arxiv.org/abs/2603.14342

作者:Jiarui Zhang,Junqi Hu,Zurong Mai,Yuhang Chen,Shuohong Lou,Henglian Huang,Lingyuan Zhao,Jianxi Huang,Yutong Lu,Haohuan Fu,Juepeng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multi-modal Large Language, top-down UAV, UAV and satellite, Existing Multi-modal Large, Large Language Models

备注

点击查看摘要

Abstract:Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.

195. 【2603.14337】On the Nature of Attention Sink that Shapes Decoding Strategy in MLLMs

链接https://arxiv.org/abs/2603.14337

作者:Suho Yoo,Youngjoon Jang,Joon Son Chung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large language models, remain partially understood, achieved remarkable success, Large language, behaviour remain partially

备注: Preprint

点击查看摘要

Abstract:Large language models and their multimodal extensions have achieved remarkable success across diverse tasks, yet the internal mechanisms that govern their reasoning behaviour remain partially understood. In particular, the attention sink, a token that attracts disproportionate attention mass, has been observed in transformer architectures, but its role is still unclear. Our goal is to understand what attention sinks represent and how they shape model behaviour during inference, rather than considering them as incidental artifacts. Through our analysis, we find that attention sink representations encode structured global information that influences the decoding process. Building on our findings, we introduce OutRo, a lightweight inference-time strategy that leverages the sink token to enhance contextual representations: (i) non-sink token representations are aligned with the sink representation in the feature space; and (ii) the sink token is allowed to attend beyond the causal constraint, facilitating information exchange with non-sink tokens. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance across representative MLLMs on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.

196. 【2603.14336】UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

链接https://arxiv.org/abs/2603.14336

作者:Yang Zhan,Yuan Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: this https URL)

197. 【2603.14331】AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

链接https://arxiv.org/abs/2603.14331

作者:Liyuan Cui,Wentao Hu,Wenyuan Zhang,Zesong Yang,Fan Shi,Xiaoqiang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:talking avatar generation, avatar generation requires, generation requires low, requires low latency, Real-time talking avatar

备注

点击查看摘要

Abstract:Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: this https URL

198. 【2603.14323】How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

链接https://arxiv.org/abs/2603.14323

作者:Guimeng Liu,Tianze Yu,Somayeh Ebrahimkhani,Lin Zhi Zheng Shawn,Kok Pin Ng,Ngai-Man Cheung

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Generalist multimodal large, multimodal large language, Generalist multimodal, achieved impressive performance, large language models

备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images. Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs' under-performance. Additional experiments are included in the Supp.

199. 【2603.14321】Personalized Cell Segmentation: Benchmark and Framework for Reference-Guided Cell Type Segmentation

链接https://arxiv.org/abs/2603.14321

作者:Bisheng Wang,Jaime S. Cardoso,Lin Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate cell segmentation, medical imaging studies, Accurate cell, cell segmentation, Personalized Cell Segmentation

备注: Accepted by IEEE ICASSP 2026. 5 pages, 3 figures. (C) 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising/promotional purposes, creating new collective works, for resale or redistribution, or reuse of any copyrighted component

点击查看摘要

Abstract:Accurate cell segmentation is critical for biological and medical imaging studies. Although recent deep learning models have advanced this task, most methods are limited to generic cell segmentation, lacking the ability to differentiate specific cell types. In this work, we introduce the Personalized Cell Segmentation (PerCS) task, which aims to segment all cells of a specific type given a reference cell. To support this task, we establish a benchmark by reorganizing publicly available datasets, yielding 1,372 images and over 110,000 annotated cells. As a pioneering solution, we propose PerCS-DINO, a framework built on the DINOv2 backbone. By integrating image features and reference embeddings via a cross-attention transformer and contrastive learning, PerCS-DINO effectively segments cells matching the reference. Extensive experiments demonstrate the effectiveness of the proposed PerCS-DINO and highlight the challenges of this new task. We expect PerCS to serve as a useful testbed for advancing research in cell-based applications.

200. 【2603.14320】Early Failure Detection and Intervention in Video Diffusion Models

链接https://arxiv.org/abs/2603.14320

作者:Kwon Byung-Ki,Sohwi Lim,Nam Hyeon-Woo,Moon Ye-Bin,Tae-Hyun Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low perceptual quality, low text-video alignment, low perceptual, rapidly advanced, perceptual quality

备注: 29 pages, 24 figures, 9 tables

点击查看摘要

Abstract:Text-to-video (T2V) diffusion models have rapidly advanced, yet generations still occasionally fail in practice, such as low text-video alignment or low perceptual quality. Since diffusion sampling is non-deterministic, it is difficult to know during inference whether a generation will succeed or fail, incurring high computational cost due to trial-and-error regeneration. To address this, we propose an early failure detection and diagnostic intervention pipeline for latent T2V diffusion models. For detection, we design a Real-time Inspection (RI) module that converts latents into intermediate video previews, enabling the use of established text-video alignment scorers for inspection in the RGB space. The RI module completes the conversion and inspection process in just 39.2ms. This is highly efficient considering that CogVideoX-5B requires 4.3s per denoising step when generating a 480p, 49-frame video on an NVIDIA A100 GPU. Subsequently, we trigger a hierarchical and early-exit intervention pipeline only when failure is predicted. Experiments on CogVideoX-5B and Wan2.1-1.3B demonstrate consistency gains on VBench with up to 2.64 times less time overhead compared to post-hoc regeneration. Our method also generalizes to a higher-capacity setting, remaining effective on Wan2.1-14B with 720p resolution and 81-frame generation. Furthermore, our pipeline is plug-and-play and orthogonal to existing techniques, showing seamless compatibility with prompt refinement and sampling guidance methods. We also provide evidence that failure signals emerge early in the denoising process and are detectable within intermediate video previews using standard vision-language evaluators.

201. 【2603.14316】Direct Object-Level Reconstruction via Probabilistic Gaussian Splatting

链接https://arxiv.org/abs/2603.14316

作者:Shuai Guo,Ao Guo,Junchao Zhao,Qi Chen,Yuxiang Qi,Zechuan Li,Dong Chen,Tianjia Shao,Mingliang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cultural heritage digitization, play important roles, reconstruction play important, existing Gaussian Splatting-based, industrial manufacturing

备注

点击查看摘要

Abstract:Object-level 3D reconstruction play important roles across domains such as cultural heritage digitization, industrial manufacturing, and virtual reality. However, existing Gaussian Splatting-based approaches generally rely on full-scene reconstruction, in which substantial redundant background information is introduced, leading to increased computational and storage overhead. To address this limitation, we propose an efficient single-object 3D reconstruction method based on 2D Gaussian Splatting. By directly integrating foreground-background probability cues into Gaussian primitives and dynamically pruning low-probability Gaussians during training, the proposed method fundamentally focuses on an object of interest and improves the memory and computational efficiency. Our pipeline leverages probability masks generated by YOLO and SAM to supervise probabilistic Gaussian attributes, replacing binary masks with continuous probability values to mitigate boundary ambiguity. Additionally, we propose a dual-stage filtering strategy for training's startup to suppress background Gaussians. And, during training, rendered probability masks are conversely employed to refine supervision and enhance boundary consistency across views. Experiments conducted on the MIP-360, TT, and NVOS datasets demonstrate that our method exhibits strong self-correction capability in the presence of mask errors and achieves reconstruction quality comparable to standard 3DGS approaches, while requiring only approximately 1/10 of their Gaussian amount. These results validate the efficiency and robustness of our method for single-object reconstruction and highlight its potential for applications requiring both high fidelity and computational efficiency.

202. 【2603.14309】In-Field 3D Wheat Head Instance Segmentation From TLS Point Clouds Using Deep Learning Without Manual Labels

链接https://arxiv.org/abs/2603.14309

作者:Tomislav Medic,Liangliang Nan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing-related domains, point clouds remains, sensing-related domains, point clouds, remains a challenge

备注: to be published in ISPRS Annals of Photogrammetry and Remote Sensing at XXV ISPRS Congress, Toronto, Canada, July 2026, 8 pages, 5 figures

点击查看摘要

Abstract:3D instance segmentation for laser scanning (LiDAR) point clouds remains a challenge in many remote sensing-related domains. Successful solutions typically rely on supervised deep learning and manual annotations, and consequently focus on objects that can be well delineated through visual inspection and manual labeling of point clouds. However, for tasks with more complex and cluttered scenes, such as in-field plant phenotyping in agriculture, such approaches are often infeasible. In this study, we tackle the task of in-field wheat head instance segmentation directly from terrestrial laser scanning (TLS) point clouds. To address the problem and circumvent the need for manual annotations, we propose a novel two-stage pipeline. To obtain the initial 3D instance proposals, the first stage uses 3D-to-2D multi-view projections, the Grounded SAM pipeline for zero-shot 2D object-centric segmentation, and multi-view label fusion. The second stage uses these initial proposals as noisy pseudo-labels to train a supervised 3D panoptic-style segmentation neural network. Our results demonstrate the feasibility of the proposed approach and show performance improvementsrelative to Wheat3DGS, a recent alternative solution for in-field wheat head instance segmentation without manual 3D annotations based on multi-view RGB images and 3D Gaussian Splatting, showcasing TLS as a competitive sensing alternative. Moreover, the results show that both stages of the proposed pipeline can deliver usable 3D instance segmentation without manual annotations, indicating promising, low-effort transferability to other comparable TLS-based point cloud segmentation tasks.

203. 【2603.14304】A Physically-Grounded Attack and Adaptive Defense Framework for Real-World Low-Light Image Enhancement

链接https://arxiv.org/abs/2603.14304

作者:Tongshun Zhang,Pingping Liu,Yuqing Lei,Zixuan Zhong,Qiuzhan Zhou,Zhiyuan Zha

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Limited illumination, Image Signal Processor, Limited, noise, LLIE

备注

点击查看摘要

Abstract:Limited illumination often causes severe physical noise and detail degradation in images. Existing Low-Light Image Enhancement (LLIE) methods frequently treat the enhancement process as a blind black-box mapping, overlooking the physical noise transformation during imaging, leading to suboptimal performance. To address this, we propose a novel LLIE approach, conceptually formulated as a physics-based attack and display-adaptive defense paradigm. Specifically, on the attack side, we establish a physics-based Degradation Synthesis (PDS) pipeline. Unlike standard data augmentation, PDS explicitly models Image Signal Processor (ISP) inversion to the RAW domain, injects physically plausible photon and read noise, and re-projects the data to the sRGB domain. This generates high-fidelity training pairs with explicitly parameterized degradation vectors, effectively simulating realistic attacks on clean signals. On the defense side, we construct a dual-layer fortified system. A noise predictor estimates degradation parameters from the input sRGB image. These estimates guide a degradation-aware Mixture of Experts (DA-MoE), which dynamically routes features to experts specialized in handling specific noise intensities. Furthermore, we introduce an Adaptive Metric Defense (AMD) mechanism, dynamically calibrating the feature embedding space based on noise severity, ensuring robust representation learning under severe degradation. Extensive experiments demonstrate that our approach offers significant plug-and-play performance enhancement for existing benchmark LLIE methods, effectively suppressing real-world noise while preserving structural fidelity. The sourced code is available at this https URL.

204. 【2603.14301】4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding

链接https://arxiv.org/abs/2603.14301

作者:Mohamed Rayan Barhdadi,Samir Abdaljalil,Rasul Khanbayov,Erchin Serpedin,Hasan Kurban

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:representations decouple geometry, opaque per-point residuals, Synchronized Fields, methods encode dynamics, methods attach semantics

备注: 34 pages, 3 figures, 7 tables. Includes supplementary material. Preprint

点击查看摘要

Abstract:Current 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.

205. 【2603.14300】Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

链接https://arxiv.org/abs/2603.14300

作者:Mingqi Gao,Jinyu Yang,Jingnan Luo,Xiantong Zhen,Jungong Han,Giovanni Montana,Feng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently generated great, generated great popularity, computer vision due, Referring video object, RVOS

备注

点击查看摘要

Abstract:Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.

206. 【2603.14297】RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360°Image Quality Assessment

链接https://arxiv.org/abs/2603.14297

作者:Yujia Wang,Yuyan Li,Jiuming Liu,Fang-Lue Zhang,Xinhu Zheng,Neil.A Dodgson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:predict perceptual quality, aims to predict, pristine reference, predict perceptual, panoramic images

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Blind 360°image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360°content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpath-based approaches have attempted to model viewing behaviors by approximating the human view-then-rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL-ScanIQA, a reinforcement-learned framework for blind 360°IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross-dataset robustness using distortion-space augmentation together with rank-consistent losses that preserve intra-image and inter-image quality orderings. Extensive experiments on three benchmarks show that RL-ScanIQA achieves superior in-dataset performance and cross-dataset generalization. Codes are available at this https URL.

207. 【2603.14294】Seeking Physics in Diffusion Noise

链接https://arxiv.org/abs/2603.14294

作者:Chujun Tang,Lei Zhong,Fangqiang Ding

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:models encode signals, encode signals predictive, diffusion models encode, video diffusion models, pretrained Diffusion Transformer

备注: 32 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.

208. 【2603.14290】RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

链接https://arxiv.org/abs/2603.14290

作者:Jiuming Liu,Guangming Wang,Zhe Liu,Chaokang Jiang,Haoang Li,Mengmeng Liu,Tianchen Deng,Marc Pollefeys,Michael Ying Yang,Hesheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable advances, LiDAR registration methods, indoor scenes, achieved remarkable, remarkable advances

备注

点击查看摘要

Abstract:Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale LiDAR registration methods has been rarely explored before. Challenges mainly arise from the huge point scale, complex point distribution, and numerous outliers within outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local descriptors and then leverage robust estimators (e.g. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose a novel end-to-end differential transformer network, termed RegFormer++, for large-scale point cloud alignment without requiring any further post-processing. Specifically, a hierarchical projection-aware 2D transformer with linear complexity is proposed to project raw LiDAR points onto a cylindrical surface and extract global point features, which can improve resilience to outliers due to long-range dependencies. Because we fill original 3D coordinates into 2D projected positions, our designed transformer can benefit from both high efficiency in 2D processing and accuracy from 3D geometric information. Furthermore, to effectively reduce wrong point matching, a Bijective Association Transformer (BAT) is designed, combining both cross attention and all-to-all point gathering. To improve training stability and robustness, a feature-transformed optimal transport module is also designed for regressing the final pose transformation. Extensive experiments on KITTI, NuScenes, and Argoverse datasets demonstrate that our model achieves state-of-the-art performance in terms of both accuracy and efficiency.

209. 【2603.14282】Multi-Period Texture Contrast Enhancement for Low-Contrast Wafer Defect Detection and Segmentation

链接https://arxiv.org/abs/2603.14282

作者:Zihan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Wafer defect segmentation, semiconductor yield optimization, Wafer defect, segmentation is pivotal, pivotal for semiconductor

备注

点击查看摘要

Abstract:Wafer defect segmentation is pivotal for semiconductor yield optimization yet remains challenged by the intrinsic conflict between microscale anomalies and highly periodic, overwhelming background textures. Existing deep learning paradigms often falter due to feature dilution during downsampling and the lack of explicit mechanisms to disentangle low-contrast defects from process-induced noise. To transcend these limitations, we propose TexWDS, a texture-aware framework that harmonizes multi-scale feature retention with frequency-domain perturbation modeling. Our methodology incorporates three strategic innovations: (1) A Multi-scale Receptive Field Reweighting strategy is introduced to mitigate aliasing effects and preserve high-frequency details of micro-defects often lost in standard pyramidal architectures. (2) The Multi-scale Unified Semantic Enhancer (MUSE) integrates local appearance with global context encoding, effectively enhancing feature discriminability in low-visibility regions. (3) Crucially, we design a plug-and-play Multi-Periodic Texture Contrast Enhancement (MPTCE) module. By modeling texture disruptions in the frequency domain, MPTCE explicitly decouples non-periodic anomalies from structured backgrounds, boosting contrast for camouflaged defects. Extensive experiments on real-world industrial datasets demonstrate that TexWDS achieves a new state-of-the-art, surpassing the baseline by 8.3% in mAP50-95 and 7.7% in recall, while reducing the false positive rate by approximately 8.6%. These results underscore the framework's robustness in handling complex periodic patterns and its suitability for high-precision manufacturing inspection.

210. 【2603.14281】DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

链接https://arxiv.org/abs/2603.14281

作者:Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging due, varying staining protocols, Training and evaluation, sensor types, heterogeneous channel configurations

备注

点击查看摘要

Abstract:Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.

211. 【2603.14276】All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation

链接https://arxiv.org/abs/2603.14276

作者:Xudong Wang,Gan Li,Zhiyu Liu,Yao Wang,Lianqing Liu,Zhi Han

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:flexible long-term deployment, severely limits flexible, limits flexible long-term, long-term deployment, catastrophic forgetting

备注: ICLR 2026

点击查看摘要

Abstract:Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.

212. 【2603.14271】oward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs

链接https://arxiv.org/abs/2603.14271

作者:Karma Phuntsho,Abdullah,Kyungmi Lee,Ickjai Lee,Euijoon Ahn

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong transferability, utility depends critically, Foundation models, clinical utility depends, supervision regimes

备注

点击查看摘要

Abstract:Foundation models (FMs) have demonstrated strong transferability across medical imaging tasks, yet their clinical utility depends critically on how pretrained representations are adapted to domain-specific data, supervision regimes, and deployment constraints. Prior surveys primarily emphasize architectural advances and application coverage, while the mechanisms of adaptation and their implications for robustness, calibration, and regulatory feasibility remain insufficiently structured. This review introduces a strategy-centric framework for FM adaptation in medical image analysis (MIA). We conceptualize adaptation as a post-pretraining intervention and organize existing approaches into five mechanisms: parameter-, representation-, objective-, data-centric, and architectural/sequence-level adaptation. For each mechanism, we analyze trade-offs in adaptation depth, label efficiency, domain robustness, computational cost, auditability, and regulatory burden. We synthesize evidence across classification, segmentation, and detection tasks, highlighting how adaptation strategies influence clinically relevant failure modes rather than only aggregate benchmark performance. Finally, we examine how adaptation choices interact with validation protocols, calibration stability, multi-institutional deployment, and regulatory oversight. By reframing adaptation as a process of controlled representational change under clinical constraints, this review provides practical guidance for designing FM-based systems that are robust, auditable, and compatible with clinical deployment.

213. 【2603.14267】DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

链接https://arxiv.org/abs/2603.14267

作者:Ngoc-Son Nguyen,Thanh V. T. Tran,Jeongsoo Choi,Hieu-Nghia Huynh-Nguyen,Truong-Son Hy,Van Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)

关键词:assistive speech technology, multimedia creation, applications in filmmaking, broad applications, pre-trained TTS model

备注: Accepted at CVPR 2026 Findings

点击查看摘要

Abstract:Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

214. 【2603.14255】ITKIT: Feasible CT Image Analysis based on SimpleITK and MMEngine

链接https://arxiv.org/abs/2603.14255

作者:Yiqin Zhang,Meiling Chen

类目:oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnosis and treatment, facto standard, clinical diagnosis, data have formed, ITKIT

备注

点击查看摘要

Abstract:CT images are widely used in clinical diagnosis and treatment, and their data have formed a de facto standard - DICOM. It is clear and easy to use, and can be efficiently utilized by data-driven analysis methods such as deep learning. In the past decade, many program frameworks for medical image analysis have emerged in the open-source community. ITKIT analyzed the characteristics of these frameworks and hopes to provide a better choice in terms of ease of use and configurability. ITKIT offers a complete pipeline from DICOM to 3D segmentation inference. Its basic practice only includes some essential steps, enabling users with relatively weak computing capabilities to quickly get started using the CLI according to the documentation. For advanced users, the OneDL-MMEngine framework provides a flexible model configuration and deployment entry. This paper conducted 12 typical experiments to verify that ITKIT can meet the needs of most basic scenarios.

215. 【2603.14254】ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization

链接https://arxiv.org/abs/2603.14254

作者:Ronghao Zhang,Shuaicheng Niu,Qi Deng,Yanjie Dong,Jian Chen,Runhao Zeng

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:unlabeled test data, limiting practical deployment, numerous edge devices, existing methods rely, Test-time adaptation

备注: 14 pages, 13figures

点击查看摘要

Abstract:Test-time adaptation (TTA) aims to improve model robustness under distribution shifts by adapting to unlabeled test data, but most existing methods rely on backpropagation (BP), which is computationally costly and incompatible with non-differentiable models such as quantized models, limiting practical deployment on numerous edge devices. Recent BP-free approaches alleviate overhead but remain either architecture-specific or limited in optimization capacity to handle high-dimensional models. We propose ZOTTA, a fully BP-free TTA framework that performs efficient adaptation using only forward passes via Zeroth-Order Optimization (ZOO). While ZOO is theoretically appealing, naive application leads to slow convergence under high-dimensional parameter spaces and unstable optimization due to the lack of labels. ZOTTA overcomes these challenges through 1) Distribution-Robust Layer Selection, which automatically identifies and freezes layers that already extract distribution-invariant features, updating only domain-sensitive layers to reduce the optimization dimensionality and accelerate convergence; 2) Spatial Feature Aggregation Alignment, which stabilizes ZOO by aligning globally aggregated spatial features between source and target to reduce gradient variance. Together, these components enable architecture-agnostic and stable BP-free adaptation. Extensive experiments on ImageNet-C/R/Sketch/A show that ZOTTA outperforms or matches BP-based methods, e.g., it reduces memory usage by 84% and improves accuracy by 3.9% over SAR on ImageNet-C.

216. 【2603.14252】MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

链接https://arxiv.org/abs/2603.14252

作者:Sagnik Majumder,Anish Nethi,Ziad Al-Halah,Kristen Grauman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:introduce the task, activity is performed, performed correctly, correctly while observing, early mistake detection

备注

点击查看摘要

Abstract:We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: this https URL.

217. 【2603.14249】OAHuman: Occlusion-Aware 3D Human Reconstruction from Monocular Images

链接https://arxiv.org/abs/2603.14249

作者:Yuanwang Yang,Hongliang Liu,Muxin Zhang,Nan Ma,Jingyu Yang,Yu-Kun Lai,Kun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world scenarios remains, scenarios remains highly, remains highly challenging, highly challenging due, surrounding objects

备注

点击查看摘要

Abstract:Monocular 3D human reconstruction in real-world scenarios remains highly challenging due to frequent occlusions from surrounding objects, people, or image truncation. Such occlusions lead to missing geometry and unreliable appearance cues, severely degrading the completeness and realism of reconstructed human models. Although recent neural implicit methods achieve impressive results on clean inputs, they struggle under occlusion due to entangled modeling of shape and texture. In this paper, we propose OAHuman, an occlusion-aware framework that explicitly decouples geometry reconstruction and texture synthesis for robust 3D human modeling from a single RGB image. The core innovation lies in the decoupling-perception paradigm, which addresses the fundamental issue of geometry-texture cross-contamination in occluded regions. Our framework ensures that geometry reconstruction is perceptually reinforced even in occluded areas, isolating it from texture interference. In parallel, texture synthesis is learned exclusively from visible regions, preventing texture errors from being transferred to the occluded areas. This decoupling approach enables OAHuman to achieve robust and high-fidelity reconstruction under occlusion, which has been a long-standing challenge in the field. Extensive experiments on occlusion-rich benchmarks demonstrate that OAHuman achieves superior performance in terms of structural completeness, surface detail, and texture realism, significantly improving monocular 3D human reconstruction under occlusion conditions.

218. 【2603.14243】BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification

链接https://arxiv.org/abs/2603.14243

作者:Haoxuan Xu,Guanglin Niu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visible-Infrared Person Re-Identification, Visible-Infrared Person, Person Re-Identification, substantial modality gap, challenging retrieval task

备注

点击查看摘要

Abstract:Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.

219. 【2603.14241】CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

链接https://arxiv.org/abs/2603.14241

作者:Zhiyi Kuang,Chengan He,Egor Zakharov,Yuxuan Xue,Shunsuke Saito,Olivier Maury,Timur Bagautdinov,Youyi Zheng,Giljoo Nam

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unified video diffusion, single input image, video diffusion model, jointly performs, input image

备注: 11 pages, 6 figures

点击查看摘要

Abstract:We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.

220. 【2603.14240】FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains

链接https://arxiv.org/abs/2603.14240

作者:Vaibhav Rathore,Divyam Gupta,Moloud Abdar,Subhasis Chaudhuri,Biplab Banerjee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Domain-Generalized Generalized Category, Generalized Category Discovery, Fine-Grained Domain-Generalized Generalized, Generalized Category, bringing open-world recognition

备注: Under Review

点击查看摘要

Abstract:We introduce the first unified framework for *Fine-Grained Domain-Generalized Generalized Category Discovery* (FG-DG-GCD), bringing open-world recognition closer to real-world deployment under domain shift. Unlike conventional GCD, which assumes labeled and unlabeled data come from the same distribution, DG-GCD learns only from labeled source data and must both recognize known classes and discover novel ones in unseen, unlabeled target domains. This problem is especially challenging in fine-grained settings, where subtle inter-class differences and large intra-class variation make domain generalization significantly harder. To support systematic evaluation, we establish the first *FG-DG-GCD benchmarks* by creating identity-preserving *painting* and *sketch* domains for CUB-200-2011, Stanford Cars, and FGVC-Aircraft using controlled diffusion-adapter stylization. On top of this ,we propose FoCUS, a single-stage framework that combines *Domain-Consistent Parts Discovery* (DCPD) for geometry-stable part reasoning with *Uncertainty-Aware Feature Augmentation* (UFA) for confidence-calibrated feature regularization through uncertainty-guided perturbations. Extensive experiments show that FoCUS outperforms strong GCD, FG-GCD, and DG-GCD baselines by **3.28%**, **9.68%**, and **2.07%**, respectively, in clustering accuracy on the proposed benchmarks. It also remains competitive on coarse-grained DG-GCD tasks while achieving nearly **3x** higher computational efficiency than the current state of the art. ^[Code and datasets will be released upon acceptance.]

Comments:
Under Review

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.14240 [cs.CV]

(or
arXiv:2603.14240v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.14240

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
221. 【2603.14232】S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction

链接https://arxiv.org/abs/2603.14232

作者:Renhe Zhang,Yuyang Tan,Jingyu Gong,Zhizhong Zhang,Lizhuang Ma,Yuan Xie,Xin Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing offline feed-forward, long image streams, repeatedly perform global, perform global computation, offline feed-forward methods

备注: 10 pages, 7 figures

点击查看摘要

Abstract:Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.

222. 【2603.14228】Not All Directions Matter: Toward Structured and Task-Aware Low-Rank Adaptation

链接https://arxiv.org/abs/2603.14228

作者:Xi Xiao,Chenrui Ma,Yunbei Zhang,Chen Liu,Zhuxuanzi Wang,Yanshu Li,Lin Zhao,Guosheng Hu,Tianyang Wang,Hao Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:parameter-efficient fine-tuning, Low-Rank Adaptation, cornerstone of parameter-efficient, structural incoherence, semantic drift

备注

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a cornerstone of parameter-efficient fine-tuning (PEFT). Yet, its efficacy is hampered by two fundamental limitations: semantic drift, by treating all update directions with equal importance, and structural incoherence, from adapting layers independently, resulting in suboptimal, uncoordinated updates. To remedy these, we propose StructLoRA, a framework that addresses both limitations through a principled, dual-component design: (1) an Information Bottleneck-guided filter that prunes task-irrelevant directions to mitigate semantic drift, and (2) a lightweight, training-only graph-based coordinator that enforces inter-layer consistency to resolve structural incoherence. Extensive experiments across large language model , vision language model, and vision model (including LLaMA, LLaVA, and ViT) demonstrate that StructLoRA consistently establishes a new state-of-the-art, outperforming not only vanilla LoRA but also advanced dynamic rank allocation and sparsity-based methods. Notably, the benefits are particularly pronounced in challenging low-rank and low-data regimes. Crucially, since our proposed modules operate only during training, StructLoRA enhances performance with zero additional inference cost, advancing the focus of PEFT -- from mere parameter compression to a more holistic optimization of information quality and structural integrity.

223. 【2603.14220】FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection

链接https://arxiv.org/abs/2603.14220

作者:Jie Li,Yingying Feng,Chi Xie,Jie Hu,Lei Tan,Jiayi Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical detection challenges, poses critical detection, models poses critical, diffusion models poses, diffusion models

备注: AAAI'26

点击查看摘要

Abstract:The remarkable realism of images generated by diffusion models poses critical detection challenges. Current methods utilize reconstruction error as a discriminative feature, exploiting the observation that real images exhibit higher reconstruction errors when processed through diffusion models. However, these approaches require costly reconstruction computations and depend on specific diffusion models, making their performance highly model-dependent. We identify a fundamental difference: real images are more difficult to fit with Gaussian distributions compared to synthetic ones. In this paper, we propose Forgery Identification via Noise Disturbance (FIND), a novel method that requires only a simple binary classifier. It eliminates reconstruction by directly targeting the core distributional difference between real and synthetic images. Our key operation is to add Gaussian noise to real images during training and label these noisy versions as synthetic. This step allows the classifier to focus on the statistical patterns that distinguish real from synthetic images. We theoretically prove that the noise-augmented real images resemble diffusion-generated images in their ease of Gaussian fitting. Furthermore, simply by adding noise, they still retain visual similarity to the original images, highlighting the most discriminative distribution-related features. The proposed FIND improves performance by 11.7% on the GenImage benchmark while running 126x faster than existing methods. By removing the need for auxiliary diffusion models and reconstruction, it offers a practical, efficient, and generalizable way to detect diffusion-generated content.

224. 【2603.14219】Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

链接https://arxiv.org/abs/2603.14219

作者:Chongxin Li,Hanzhang Wang,Lian Duan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:latent structural responsiveness, models' latent structural, Safety prompts constitute, Safety Subnetwork Hypothesis, constitute an interpretable

备注: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL)

点击查看摘要

Abstract:Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

225. 【2603.14214】UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

链接https://arxiv.org/abs/2603.14214

作者:Xingyuan Li,Songcheng Du,Yang Zou,HaoYuan Xu,Zhiying Jiang,Jinyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:visually consistent representation, integrate complementary information, downstream vision tasks, consistent representation, benefiting both human

备注: 11 pages, 8 figures, published to CVPR2026

点击查看摘要

Abstract:Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at this https URL.

226. 【2603.14209】ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

链接https://arxiv.org/abs/2603.14209

作者:Shishi Xiao,Tongyu Zhou,David Laidlaw,Gromit Yeuk-Yin Chan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:seamlessly integrating visual, integrating visual elements, seamlessly integrating, effective medium, pictorial chart

备注: Project page: [this https URL](https://chartist-ai.github.io/)

点击查看摘要

Abstract:A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: this https URL.

227. 【2603.14207】DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

链接https://arxiv.org/abs/2603.14207

作者:Axi Niu,Kang Zhang,Qingsen Yan,Hao Jin,Jinqiu Sun,Yanning Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Optical Character Recognition, Scene Text Image, restore high-resolution details, machine recognition, Scene Text

备注

点击查看摘要

Abstract:Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.

228. 【2603.14203】Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual Segmentation

链接https://arxiv.org/abs/2603.14203

作者:Kai Peng,Yunzhe Shen,Miao Zhang,Leiye Liu,Yidong Han,Wei Ji,Jingjing Li,Yongri Piao,Huchuan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:segment sounding objects, dynamic visual scenes, Audio-Visual Segmentation, ability to capture, capture and segment

备注

点击查看摘要

Abstract:The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at this https URL}.

229. 【2603.14189】Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions

链接https://arxiv.org/abs/2603.14189

作者:Zhiyang Lu,Wen Jiang,Tianren Wu,Zhichao Wang,Changwang Zhang,Siqi Shen,Ming Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:emerging biometric technology, emerging biometric, biometric technology, technology that enables, enables non-intrusive

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present \textbf{LRGait}, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose \textbf{EMGaitNet}, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.

230. 【2603.14188】Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis

链接https://arxiv.org/abs/2603.14188

作者:Zhiwei Wang,Yuxing Li,Meilu Zhu,Defeng He,Edmund Y. Lam

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lack clear structural, appearance cues, lack clear, clear structural, structural or appearance

备注: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI), 2026

点击查看摘要

Abstract:Accurate diagnosis of glaucoma is challenging, as early-stage changes are subtle and often lack clear structural or appearance cues. Most existing approaches rely on a single modality, such as fundus or optical coherence tomography (OCT), capturing only partial pathological information and often missing early disease progression. In this paper, we propose an iterative multimodal optimization model (IMO) for joint segmentation and grading. IMO integrates fundus and OCT features through a mid-level fusion strategy, enhanced by a cross-modal feature alignment (CMFA) module to reduce modality discrepancies. An iterative refinement decoder progressively optimizes the multimodal features through a denoising diffusion mechanism, enabling fine-grained segmentation of the optic disc and cup while supporting accurate glaucoma grading. Extensive experiments show that our method effectively integrates multimodal features, providing a comprehensive and clinically significant approach to glaucoma assessment. Source codes are available at this https URL.

Comments:
Accepted to IEEE International Symposium on Biomedical Imaging (ISBI), 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.14188 [cs.CV]

(or
arXiv:2603.14188v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.14188

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
231. 【2603.14187】Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

链接https://arxiv.org/abs/2603.14187

作者:Clément Grisi,Khrystyna Faryna,Nefise Uysal,Vittorio Agosti,Enrico Munari,Solène-Florence Kammerer-Jacquet,Paulo Guilherme de Oliveira Salles,Yuri Tolkach,Reinhard Büttner,Sofiya Semko,Maksym Pikul,Axel Heidenreich,Jeroen van der Laak,Geert Litjens

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:guiding adjuvant treatment, Accurate prediction, prediction of biochemical, critical for guiding, guiding adjuvant

备注: Preprint

点击查看摘要

Abstract:Accurate prediction of biochemical recurrence (BCR) after radical prostatectomy is critical for guiding adjuvant treatment and surveillance decisions in prostate cancer. However, existing clinicopathological risk models reduce complex morphology to relatively coarse descriptors, leaving substantial prognostic information embedded in routine histopathology underexplored. We present a deep learning-based biomarker that predicts continuous, patient-specific risk of BCR directly from HE-stained whole-slide prostatectomy specimens. Trained end-to-end on time-to-event outcomes and evaluated across four independent international cohorts, our model demonstrates robust generalization across institutions and patient populations. When integrated with the CAPRA-S clinical risk score, the deep learning risk score consistently improved discrimination for BCR, increasing concordance indices from 0.725-0.772 to 0.749-0.788 across cohorts. To support clinical interpretability, outcome-grounded analyses revealed subtle histomorphological patterns associated with recurrence risk that are not captured by conventional clinicopathological risk scores. This multicohort study demonstrates that deep learning applied to routine prostate histopathology can deliver reproducible and clinically generalizable biomarkers that augment postoperative risk stratification, with potential to support personalized management of prostate cancer in real-world clinical settings.

232. 【2603.14186】Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

链接https://arxiv.org/abs/2603.14186

作者:Advaith Ravishankar,Serena Liu,Mingyang Wang,Todd Zhou,Jeffrey Zhou,Arnav Sharma,Ziling Hu,Léopold Das,Abdulaziz Sobirov,Faizaan Siddique,Freddy Yu,Seungjoo Baek,Yan Luo,Mengyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:produce high-quality images, sequential ODE, ODE or denoising, models produce high-quality, inference remains expensive

备注

点击查看摘要

Abstract:State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.

233. 【2603.14184】Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

链接https://arxiv.org/abs/2603.14184

作者:Ruiying Peng,Xueyu Wu,Jing Lei,Lu Hou,Yuanzheng Ma,Xiaohui Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, Multimodal large, large language models, extended reasoning modes, large language

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

234. 【2603.14176】BluRef: Unsupervised Image Deblurring with Dense-Matching References

链接https://arxiv.org/abs/2603.14176

作者:Bang-Dang Pham,Anh Tran,Cuong Pham,Minh Hoai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:training data collection, paper introduces, utilizes a simple, enhancing the applicability, applicability and effectiveness

备注: Accepted to CVPR 2026. Project page: [this https URL](https://qualcomm-ai-research.github.io/BluRef/)

点击查看摘要

Abstract:This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudo-ground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.

235. 【2603.14175】Balancing Multimodal Domain Generalization via Gradient Modulation and Projection

链接https://arxiv.org/abs/2603.14175

作者:Hongzhao Li,Guohao Shen,Shupan Li,Mingliang Xu,Muhammad Haris Khan

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:enhance model generalization, leverages the complementary, enhance model, MMDG, GMP

备注: AAAI 2026 Oral Accepted

点击查看摘要

Abstract:Multimodal Domain Generalization (MMDG) leverages the complementary strengths of multiple modalities to enhance model generalization on unseen domains. A central challenge in multimodal learning is optimization imbalance, where modalities converge at different speeds during training. This imbalance leads to unequal gradient contributions, allowing some modalities to dominate the learning process while others lag behind. Existing balancing strategies typically regulate each modality's gradient contribution based on its classification performance on the source domain to alleviate this issue. However, relying solely on source-domain accuracy neglects a key insight in MMDG: modalities that excel on the source domain may generalize poorly to unseen domains, limiting cross-domain gains. To overcome this limitation, we propose Gradient Modulation Projection (GMP), a unified strategy that promotes balanced optimization in MMDG. GMP first decouples gradients associated with classification and domain-invariance objectives. It then modulates each modality's gradient based on semantic and domain confidence. Moreover, GMP dynamically adjusts gradient projections by tracking the relative strength of each task, mitigating conflicts between classification and domain-invariant learning within modality-specific encoders. Extensive experiments demonstrate that GMP achieves state-of-the-art performance and integrates flexibly with diverse MMDG methods, significantly improving generalization across multiple benchmarks.

236. 【2603.14153】Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories

链接https://arxiv.org/abs/2603.14153

作者:Junyao Hu,Zhongwei Cheng,Waikeung Wong,Xingxing Zou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced single-garment visualization, real-world fashion centers, current VTON systems, Virtual try-on, single-garment visualization

备注: CVPR 2026; Project Page: [this https URL](https://artmesciencelab.github.io/Garments2Look)

点击查看摘要

Abstract:Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.

237. 【2603.14152】SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation

链接https://arxiv.org/abs/2603.14152

作者:Anbang Wang,Yuzhuo Ao,Shangzhe Wu,Chi-Keung Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:space remains underexplored, achieved remarkable fidelity, precise structural articulations, fidelity and speed, critical limitation

备注: 26 pages, 9 figures

点击查看摘要

Abstract:Native 3D generative models have achieved remarkable fidelity and speed, yet they suffer from a critical limitation: inability to prescribe precise structural articulations, where precise structural control within the native 3D space remains underexplored. This paper proposes SK-Adapter, a simple and yet highly efficient and effective framework that unlocks precise skeletal manipulation for native 3D generation. Moving beyond text or image prompts, which can be ambiguous for precise structure, we treat the 3D skeleton as a first-class control signal. SK-Adapter is a lightweight structural adapter network that encodes joint coordinates and topology into learnable tokens, which are injected into the frozen 3D generation backbone via cross-attention. This smart design allows the model to not only effectively "attend" to specific 3D structural constraints but also preserve its original generative priors. To bridge the data gap, we contribute Objaverse-TMS dataset, a large-scale dataset of 24k text-mesh-skeleton pairs. Extensive experiments confirm that our method achieves robust structural control while preserving the geometry and texture quality of the foundation model, significantly outperforming existing baselines. Furthermore, we extend this capability to local 3D editing, enabling the region specific editing of existing assets with skeletal guidance, which is unattainable by previous methods. Project Page: this https URL

238. 【2603.14151】Seeing Through the PRISM: Compound Controllable Restoration of Scientific Images

链接https://arxiv.org/abs/2603.14151

作者:Rupa Kurinchi-Vendhan,Pratyusha Sharma,Antonio Torralba,Sara Beery

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:environmental imagery, imagery often suffer, noise related, restoration, PRISM

备注

点击查看摘要

Abstract:Scientific and environmental imagery often suffer from complex mixtures of noise related to the sensor and the environment. Existing restoration methods typically remove one degradation at a time, leading to cascading artifacts, overcorrection, or loss of meaningful signal. In scientific applications, restoration must be able to simultaneously handle compound degradations while allowing experts to selectively remove subsets of distortions without erasing important features. To address these challenges, we present PRISM (Precision Restoration with Interpretable Separation of Mixtures). PRISM is a prompted conditional diffusion framework which combines compound-aware supervision over mixed degradations with a weighted contrastive disentanglement objective that aligns primitives and their mixtures in the latent space. This compositional geometry enables high-fidelity joint removal of overlapping distortions while also allowing flexible, targeted fixes through natural language prompts. Across microscopy, wildlife monitoring, remote sensing, and urban weather datasets, PRISM outperforms state-of-the-art baselines on complex compound degradations, including zero-shot mixtures not seen during training. Importantly, we show that selective restoration significantly improves downstream scientific accuracy in several domains over standard "black-box" restoration. These results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.

239. 【2603.14150】CIPHER: Culvert Inspection through Pairwise Frame Selection and High-Efficiency Reconstruction

链接https://arxiv.org/abs/2603.14150

作者:Seoyoung Lee,Zhangyang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:flood management operations, Automated culvert inspection, Automated culvert, management operations, increase the safety

备注: Accepted by ICCV 2026 End-to-End 3D Learning

点击查看摘要

Abstract:Automated culvert inspection systems can help increase the safety and efficiency of flood management operations. As a key step to this system, we present an efficient RGB-based 3D reconstruction pipeline for culvert-like structures in visually repetitive environments. Our approach first selects informative frame pairs to maximize viewpoint diversity while ensuring valid correspondence matching using a plug-and-play module, followed by a reconstruction model that simultaneously estimates RGB appearance, geometry, and semantics in real-time. Experiments demonstrate that our method effectively generates accurate 3D reconstructions and depth maps, enhancing culvert inspection efficiency with minimal human intervention.

240. 【2603.14145】MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

链接https://arxiv.org/abs/2603.14145

作者:Arushi Goel,Sreyan Ghosh,Vatsal Agarwal,Nishit Anand,Kaousheik Jayakumar,Lasha Koroshinadze,Yao Xu,Katie Lyons,James Case,Karan Sapra,Kevin J. Shih,Siddharth Gururani,Abhinav Shrivastava,Ramani Duraiswami,Dinesh Manocha,Andrew Tao,Bryan Catanzaro,Mohammad Shoeybi,Wei Ping

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, shown strong performance, Multimodal Large

备注: Project Page: [this https URL](https://huggingface.co/datasets/nvidia/MMOU)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.

241. 【2603.14132】DualSwinFusionSeg: Multimodal Martian Landslide Segmentation via Dual Swin Transformer with Multi-Scale Fusion and UNet++

链接https://arxiv.org/abs/2603.14132

作者:Shahriar Kabir,Abdullah Muhammed Amimul Ehsan,Istiak Ahmmed Rifti,Md Kaykobad Reza

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Valles Marineris,is important, future robotic exploration, Valles Marineris,is, tectonically active regions, Marineris,is important

备注: 10 pages, 2 Figures, 12 Tables. Code is available at: [this https URL](https://github.com/amimulamim/Mars-LS-Segmentation)

点击查看摘要

Abstract:Automated segmentation of Martian landslides, particularly in tectonically active regions such as Valles Marineris,is important for planetary geology, hazard assessment, and future robotic exploration. However, detecting landslides from planetary imagery is challenging due to the heterogeneous nature of available sensing modalities and the limited number of labeled samples. Each observation combines RGB imagery with geophysical measurements such as digital elevation models, slope maps, thermal inertia, and contextual grayscale imagery, which differ significantly in resolution and statistical properties. To address these challenges, we propose DualSwinFusionSeg, a multimodal segmentation architecture that separates modality-specific feature extraction and performs multi-scale cross-modal fusion. The model employs two parallel Swin Transformer V2 encoders to independently process RGB and auxiliary geophysical inputs, producing hierarchical feature representations. Corresponding features from the two streams are fused at multiple scales and decoded using a UNet++ decoder with dense nested skip connections to preserve fine boundary details. Extensive ablation studies evaluate modality contributions, loss functions, decoder architectures, and fusion strategies. Experiments on the MMLSv2 dataset from the PBVS 2026 Mars-LS Challenge show that modality-specific encoders and simple concatenation-based fusion improve segmentation accuracy under limited training data. The final model achieves 0.867 mIoU and 0.905 F1 on the development benchmark and 0.783 mIoU on the held-out test set, demonstrating strong performance for multimodal planetary surface segmentation.

242. 【2603.14128】Diffusion Reinforcement Learning via Centered Reward Distillation

链接https://arxiv.org/abs/2603.14128

作者:Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:flow matching pretraining, fine-grained prompt fidelity, practically important behaviors, matching pretraining objectives, generative performance

备注

点击查看摘要

Abstract:Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.

243. 【2603.14127】Implementation and discussion of the Pith Estimation on Rough Log End Images using Local Fourier Spectrum Analysis method

链接https://arxiv.org/abs/2603.14127

作者:Henry Marichal,Diego Passarella,Gregory Randall

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fourier Spectrum Analysis, Rough Log End, Local Fourier Spectrum, Log End images, Pith Estimation

备注

点击查看摘要

Abstract:In this article, we analyze and propose a Python implementation of the method "Pith Estimation on Rough Log End images using Local Fourier Spectrum Analysis", by Rudolf Schraml and Andreas Uhl. The algorithm is tested over two datasets.

244. 【2603.14125】Low-Field Magnetic Resonance Image Enhancement using Undersampled k-Space

链接https://arxiv.org/abs/2603.14125

作者:Daniel Tweneboah Anyimadu(1),Mohammed Abdalla(2),Mohammed M. Abdelsamea(1),Ahmed Karam Eldaly(1 and 3) ((1) Department of Computer Science, University of Exeter, United Kingdom, (2) Neurology Department, Royal Devon and Exeter Hospital, Exeter, United Kingdom, (3) UCL Hawkes Institute, Department of Computer Science, University College London, London, United Kingdom)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, Low-field magnetic resonance, offers a cost-effective, resource-limited settings, resonance imaging

备注: 13 pages, 8 figures

点击查看摘要

Abstract:Low-field magnetic resonance imaging (MRI) offers a cost-effective alternative for medical imaging in resource-limited settings. However, its widespread adoption is hindered by two key challenges: prolonged scan times and reduced image quality. Accelerated acquisition can be achieved using k-space undersampling, while image enhancement traditionally relies on spatial-domain postprocessing. In this work, we propose a novel deep learning framework based on a U-Net variant that operates directly in k-space to super-resolve low-field MR images directly using undersampled data while quantifying the impact of reduced k-space sampling. Unlike conventional approaches that treat image super-resolution as a postprocessing step following image reconstruction from undersampled k-space, our unified model integrates both processes, leveraging k-space information to achieve superior image fidelity. Extensive experiments on synthetic and real low-field brain MRI datasets demonstrate that k-space-driven image super-resolution outperforms conventional spatial-domain counterparts. Furthermore, our results show that undersampled k-space reconstructions achieve comparable quality to full k-space acquisitions, enabling substantial scan-time acceleration without compromising diagnostic utility.

245. 【2603.14120】Low-Field Magnetic Resonance Image Quality Enhancement using Undersampled k-Space and Out-of-Distribution Generalisation

链接https://arxiv.org/abs/2603.14120

作者:Daniel Tweneboah Anyimadu(1),Mohammed M. Abdelsamea(1),Ahmed Karam Eldaly(1 and 2) ((1) Department of Computer Science, University of Exeter, Exeter, United Kingdom, (2) UCL Hawkes Institute, Department of Computer Science, University College London, London, United Kingdom)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offers affordable access, magnetic resonance imaging, Low-field magnetic resonance, offers affordable, magnetic resonance

备注: 5 pages, 5 figures

点击查看摘要

Abstract:Low-field magnetic resonance imaging (MRI) offers affordable access to diagnostic imaging but faces challenges such as prolonged acquisition times and reduced image quality. Although accelerated imaging via k-space undersampling helps reduce scan time, image quality enhancement methods often rely on spatial-domain postprocessing. Deep learning achieved state-of-the-art results in both domains. However, most models are trained and evaluated using in-distribution (InD) data, creating a significant gap in understanding model performance when tested using out-of-distribution (OOD) data. To address these issues, we propose a novel framework that reconstructs high-field-like MR images directly from undersampled low-field MRI k-space, quantifies the impact of reduced sampling, and evaluates the generalisability of the model using OOD. Our approach utilises a k-space dual channel U-Net to jointly process the real and imaginary components of undersampled k-space, restoring missing frequency content, and incorporates an ensemble strategy to generate uncertainty maps. Experiments on low-field brain MRI demonstrate that our k-space-driven image quality enhancement outperforms the counterpart spatial-domain and other state-of-the-art baselines, achieving image quality comparable to full high-field k-space acquisitions using OOD data. To the best of our knowledge, this work is among the first to combine low-field MR image reconstruction, quality enhancement using undersampled k-space, and uncertainty quantification within a unified framework.

246. 【2603.14117】Improving Visual Reasoning with Iterative Evidence Refinement

链接https://arxiv.org/abs/2603.14117

作者:Zeru Shi,Kai Mei,Yihao Quan,Dimitris N.Metaxas,Ruixiang Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision language models, Vision language, requires re-grounding intermediate, re-grounding intermediate steps, increasingly capable

备注

点击查看摘要

Abstract:Vision language models (VLMs) are increasingly capable of reasoning over images, but robust visual reasoning often requires re-grounding intermediate steps in the underlying visual evidence. Recent approaches typically rely on external image operations such as zooming or cropping to re-access fine-grained details during inference, which requires additional image re-encoding and can disrupt the reasoning trajectory. We argue that VLMs already provide strong internal signals for identifying and reusing visual evidence, and that these signals can be directly leveraged to support image-grounded reasoning. Motivated by this insight, we propose an end-to-end self-revisit framework, SIEVE, that trains models to re-engage image evidence through internal representations. SIEVE automatically extracts embeddings of salient image regions and injects them into the reasoning chain when additional grounding is needed, enabling later steps to condition on relevant visual cues without external tool calls or re-encoding. We use reinforcement learning to teach the model when to trigger visual revisiting and which region embeddings to retrieve and insert during the reasoning process. Experiments on multiple visual reasoning benchmarks, together with perception, reasoning, and hallucination evaluations, show that SIEVE yields consistent gains, improving performance by 8 percent on average across several benchmarks.

247. 【2603.14112】Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

链接https://arxiv.org/abs/2603.14112

作者:Dan Wang,Haiyan Sun,Shan Du,Z. Jane Wang,Zhaochong An,Serge Belongie,Xinrui Cui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high resolution images, reconstruct high resolution, Image super-resolution, resolution images, high perceptual quality

备注

点击查看摘要

Abstract:Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.

248. 【2603.14086】Effective Feature Learning for 3D Medical Registration via Domain-Specialized DINO Pretraining

链接https://arxiv.org/abs/2603.14086

作者:Eytan Kats,Mattias P. Heinrich

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurate longitudinal assessment, enabling accurate longitudinal, multi-modal data fusion, clinical imaging workflows, enabling accurate

备注: Accepted for International Symposium on Biomedical Imaging 2026 (ISBI 2026)

点击查看摘要

Abstract:Medical image registration is a critical component of clinical imaging workflows, enabling accurate longitudinal assessment, multi-modal data fusion, and image-guided interventions. Intensity-based approaches often struggle with interscanner variability and complex anatomical deformations, whereas feature-based methods offer improved robustness by leveraging semantically informed representations. In this work, we investigate DINO-style self-supervised pretraining directly on 3D medical imaging data, aiming to learn dense volumetric features well suited for deformable registration. We assess the resulting representations on challenging interpatient abdominal registration task across both MRI and CT modalities. Our domain-specialized pretraining outperforms the DINOv2 model trained on a large-scale collection of natural images, while requiring substantially lower computational resources at inference time. Moreover, it surpasses established registration models under out-of-domain evaluation, demonstrating the value of task-agnostic yet medical imaging-focused pretraining for robust and efficient 3D image registration.

249. 【2603.14077】Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling

链接https://arxiv.org/abs/2603.14077

作者:Viet Dung Nguyen,Mobina Ghorbaninejad,Chengyi Ma,Reynold Bailey,Gabriel J. Diaz,Alexander Fix,Ryan J. Suess,Alexander Ororbia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low energy consumption, offering great utility, eye tracking pipelines, real-world eye tracking, Eye feature extraction

备注: 7 pages, 3 figures, 1 tables, accepted to ETRA 2026

点击查看摘要

Abstract:Eye feature extraction from event-based data streams can be performed efficiently and with low energy consumption, offering great utility to real-world eye tracking pipelines. However, few eye feature extractors are designed to handle sudden changes in event density caused by the changes between gaze behaviors that vary in their kinematics, leading to degraded prediction performance. In this work, we address this problem by introducing the \emph{adaptive inference state space model} (AISSM), a novel architecture for feature extraction that is capable of dynamically adjusting the relative weight placed on current versus recent information. This relative weighting is determined via estimates of the signal-to-noise ratio and event density produced by a complementary \emph{dynamic confidence network}. Lastly, we craft and evaluate a novel learning technique that improves training efficiency. Experimental results demonstrate that the AISSM system outperforms state-of-the-art models for event-based eye feature extraction.

250. 【2603.14076】SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement

链接https://arxiv.org/abs/2603.14076

作者:Yiran Guo,Simone Mentasti,Xiaofeng Jin,Matteo Frosi,Matteo Matteucci

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:perceive dense scene, dense scene geometry, monocular video streams, enabling agents, video streams

备注: mian paper: 20 pages, 6 figures; appendix: 15 pages, 5 figures

点击查看摘要

Abstract:3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.

251. 【2603.14074】Self-Supervised Uncertainty Estimation For Super-Resolution of Satellite Images

链接https://arxiv.org/abs/2603.14074

作者:Zhe Zheng,Valéry Dewil,Pablo Arias

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:paired low, satellite imagery, imagery is challenging, challenging due, high-resolution data

备注: Conference submission

点击查看摘要

Abstract:Super-resolution (SR) of satellite imagery is challenging due to the lack of paired low-/high-resolution data. Recent self-supervised SR methods overcome this limitation by exploiting the temporal redundancy in burst observations, but they lack a mechanism to quantify uncertainty in the reconstruction. In this work, we introduce a novel self-supervised loss that allows to estimate uncertainty in image super-resolution without ever accessing the ground-truth high-resolution data. We adopt a decision-theoretic perspective and show that minimizing the corresponding Bayesian risk yields the posterior mean and variance as optimal estimators. We validate our approach on a synthetic SkySat L1B dataset and demonstrate that it produces calibrated uncertainty estimates comparable to supervised methods. Our work bridges self-supervised restoration with uncertainty quantification, making a practical framework for uncertainty-aware image reconstruction.

252. 【2603.14073】MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation

链接https://arxiv.org/abs/2603.14073

作者:Byungjun Kim,Soobin Um,Jong Chul Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:generating high-fidelity, significant challenge, recent advances, remains a significant, dynamic motion remains

备注

点击查看摘要

Abstract:Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.

253. 【2603.14062】MPDiff: Temporal Mixed-Precision for Diffusion Models

链接https://arxiv.org/abs/2603.14062

作者:Basile Lewandowski,Simon Kurz,Aditya Shankar,Robert Birke,Jian-Jia Chen,Lydia Y. Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:iterative denoising processes, high inference latency, go-to method, processes has high, denoising timesteps

备注

点击查看摘要

Abstract:Diffusion models are the go-to method for Text-to-Image generation, but their iterative denoising processes has high inference latency. Quantization reduces compute time by using lower bitwidths, but applies a fixed precision across all denoising timesteps, leaving an entire optimization axis unexplored. We propose TMPDiff, a temporal mixed-precision framework for diffusion models that assigns different numeric precision to different denoising timesteps. We hypothesize that quantization errors accumulate additively across timesteps, which we then validate experimentally. Based on our observations, we develop an adaptive bisectioning-based algorithm, which assigns per-step precisions with linear evaluation complexity, reducing an otherwise exponential search problem. Across four state-of-the-art diffusion models and three datasets, TMPDiff consistently outperforms uniform-precision baselines at matched speedup, achieving 10 to 20% improvement in perceptual quality. On FLUX.1-dev, TMPDiff achieves 90% SSIM relative to the full-precision model at a speedup of 2.5x over 16-bit inference.

254. 【2603.14052】A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

链接https://arxiv.org/abs/2603.14052

作者:Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Zachary Yahn,Ling Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

关键词:paper presents, perception-action exploration, exploration, perception-action, multi-round perception-action exploration

备注: Accepted by CVPR2026

点击查看摘要

Abstract:This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at this https URL.

255. 【2603.14039】EyeWorld: A Generative World Model of Ocular State and Dynamics

链接https://arxiv.org/abs/2603.14039

作者:Ziyu Gao,Xinyuan Wu,Xiaolan Chen,Zhuoran Liu,Ruoyu Chen,Bowen Liu,Bingjie Yan,Zhenhan Wang,Kai Jin,Jiancheng Yang,Yih Chung Tham,Mingguang He,Danli Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Ophthalmic decision-making depends, subtle lesion-scale cues, lesion-scale cues interpreted, medical foundation models, foundation models remain

备注: 38 pages, 8 figures

点击查看摘要

Abstract:Ophthalmic decision-making depends on subtle lesion-scale cues interpreted across multimodal imaging and over time, yet most medical foundation models remain static and degrade under modality and acquisition shifts. Here we introduce EyeWorld, a generative world model that conceptualizes the eye as a partially observed dynamical system grounded in clinical imaging. EyeWorld learns an observation-stable latent ocular state shared across modalities, unifying fine-grained parsing, structure-preserving cross-modality translation and quality-robust enhancement within a single framework. Longitudinal supervision further enables time-conditioned state transitions, supporting forecasting of clinically meaningful progression while preserving stable anatomy. By moving from static representation learning to explicit dynamical modeling, EyeWorld provides a unified approach to robust multimodal interpretation and prognosis-oriented simulation in medicine.

256. 【2603.14031】Intrinsic Tolerance in C-Arm Imaging: How Extrinsic Re-optimization Preserves 3D Reconstruction Accuracy

链接https://arxiv.org/abs/2603.14031

作者:Lin Li,Benjamin Aubert,Paul Kemper,Aric Plumley

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:C-arm fluoroscopy, relies on accurate, Purpose, textbf, pixels

备注

点击查看摘要

Abstract:\textbf{Purpose:} C-arm fluoroscopy's 3D reconstruction relies on accurate intrinsic calibration, which is often challenging in clinical practice. This study ensures high-precision reconstruction accuracy by re-optimizing the extrinsic parameters to compensate for intrinsic calibration errors. \noindent\textbf{Methods:} We conducted both simulation and real-world experiments using five commercial C-arm systems. Intrinsic parameters were perturbed in controlled increments. Focal length was increased by 100 to 700 pixels ($\approx$20 mm to 140 mm) and principal point by 20 to 200 pixels. For each perturbation, we (1) reconstructed 3D points from known phantom geometries, (2) re-estimated extrinsic poses using standard optimization, and (3) measured reconstruction and reprojection errors relative to ground truth. \noindent\textbf{Results:} Even with focal length errors up to 500 pixels ($\approx$100 mm, assuming a nominal focal length of $\sim$1000 mm), mean 3D reconstruction error remained under 0.2 mm. Larger focal length deviations (700 pixels) elevated error to only $\approx$0.3 mm. Principal point shifts up to 200 pixels introduced negligible reconstruction error once extrinsic parameters were re-optimized, with reprojection error increases below 0.5 pixels. \noindent\textbf{Conclusion:} Moderate errors in intrinsic calibration can be effectively mitigated by extrinsic re-optimization, preserving submillimeter 3D reconstruction accuracy. This intrinsic tolerance suggests a practical pathway to relax calibration precision requirements, thereby simplifying C-arm system setup and reducing clinical workflow burden without compromising performance.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.14031 [cs.CV]

(or
arXiv:2603.14031v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.14031

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Lin Li [view email] [v1]
Sat, 14 Mar 2026 17:12:05 UTC (1,071 KB)

257. 【2603.14023】High-speed Imaging through Turbulence with Event-based Light Fields

链接https://arxiv.org/abs/2603.14023

作者:Yu-Hsiang Huang,Levi Burner,Sachin Shah,Ziyuan Qu,Adithya Pediredla,Christopher A. Metzler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fast-moving extended non-rigid, high frame rate, imaging fast-moving extended, extended non-rigid objects, fast-moving extended

备注

点击查看摘要

Abstract:This work introduces and demonstrates the first system capable of imaging fast-moving extended non-rigid objects through strong atmospheric turbulence at high frame rate. Event cameras are a novel sensing architecture capable of estimating high-speed imagery at thousands of frames per second. However, on their own event cameras are unable to disambiguate scene motion from turbulence. In this work, we overcome this limitation using event-based light field cameras: By simultaneously capturing multiple views of a scene, event-based light field cameras and machine learning-based reconstruction algorithms are able to disambiguate motion-induced dynamics, which produce events that are strongly correlated across views, from turbulence-induced dynamics, which produce events that are weakly correlated across view. Tabletop experiments demonstrate event-based light field can overcome strong turbulence while imaging high-speed objects traveling at up to 16,000 pixels per second.

258. 【2603.14022】A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations

链接https://arxiv.org/abs/2603.14022

作者:Neelu Madan,Àlex Pujol,Andreas Møgelmose,Sergio Escalera,Kamal Nasrollahi,Graham W. Taylor,Thomas B. Moeslund

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unsupervised object-centric learning, decomposing visual scenes, Euclidean space, vector representations called, compact vector representations

备注

点击查看摘要

Abstract:Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a "curvature--task tradeoff": low curvature ($c{=}0.2$) matches or outperforms Euclidean on parent slot retrieval, while moderate curvature ($c{=}0.5$) achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{this https URL}{this http URL}.

259. 【2603.14021】EI-Part: Explode for Completion and Implode for Refinement

链接https://arxiv.org/abs/2603.14021

作者:Wanhu Sun,Zhongjin Luo,Heliang Zheng,Jiahao Chang,Chongjie Ye,Huiang He,Shengchu Zhao,Rongfei Jia,Xiaoguang Han

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:including gaming, film production, downstream applications, industrial design, structural coherence

备注

点击查看摘要

Abstract:Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components, characterized by strong structural coherence, geometric plausibility, geometric fidelity, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy fully leverages spatial resolution, enabling flexible part completion and fine geometric detail generation. To maintain structural coherence between parts, a self-attention mechanism is incorporated in both exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments on multiple benchmarks demonstrate that EI-Part efficiently produces semantically meaningful and structurally coherent parts with fine-grained geometric details, achieving state-of-the-art performance in part-level 3D generation. Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.14021 [cs.CV]

(or
arXiv:2603.14021v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.14021

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Wanhu Sun [view email] [v1]
Sat, 14 Mar 2026 16:49:37 UTC (43,348 KB)

260. 【2603.14012】Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

链接https://arxiv.org/abs/2603.14012

作者:Jiachen Li,Xiaojin Gong,Dongping Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalized person Re-identification, Domain Generalized person, unseen target domains, Domain Generalized, person Re-identification

备注

点击查看摘要

Abstract:Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at this https URL.

261. 【2603.14005】owards Generalizable Deepfake Detection via Real Distribution Bias Correction

链接https://arxiv.org/abs/2603.14005

作者:Ming-Hui Liu,Harry Cheng,Xin Luo,Xin-Shun Xu,Mohan S. Kankanhalli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing methods attempt, dynamically evolving forgery, evolving forgery types, generalize deepfake detectors, future unseen forgeries

备注: First Version

点击查看摘要

Abstract:To generalize deepfake detectors to future unseen forgeries, most existing methods attempt to simulate the dynamically evolving forgery types using available source domain data. However, predicting an unbounded set of future manipulations from limited prior examples is infeasible. To overcome this limitation, we propose to exploit the invariance of \textbf{real data} from two complementary perspectives: the fixed population distribution of the entire real class and the inherent Gaussianity of individual real images. Building on these properties, we introduce the Real Distribution Bias Correction (RDBC) framework, which consists of two key components: the Real Population Distribution Estimation module and the Distribution-Sampled Feature Whitening module. The former utilizes the independent and identically distributed (\iid) property of real samples to derive the normal distribution form of their statistics, from which the distribution parameters can be estimated using limited source domain data. Based on the learned population distribution, the latter utilizes the inherent Gaussianity of real data as a discriminative prior and performs a sampling-based whitening operation to amplify the Gaussianity gap between real and fake samples. Through synergistic coupling of the two modules, our model captures the real-world properties of real samples, thereby enhancing its generalizability to unseen target domains. Extensive experiments demonstrate that RDBC achieves state-of-the-art performance in both in-domain and cross-domain deepfake detection.

262. 【2603.14004】U-Face: An Efficient and Generalizable Framework for Unsupervised Facial Attribute Editing via Subspace Learning

链接https://arxiv.org/abs/2603.14004

作者:Bo Liu,Xuan Cui,Run Zeng,Wei Duan,Chongwen Liu,Jinrui Qian,Lianggui Tang,Hongping Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:virtual avatar creation, human-computer interaction systems, interaction systems due, facial attribute editing, Latent space-based facial

备注

点击查看摘要

Abstract:Latent space-based facial attribute editing methods have gained popularity in applications such as digital entertainment, virtual avatar creation, and human-computer interaction systems due to their potential for efficient and flexible attribute manipulation, particularly for continuous edits. Among these, unsupervised latent space-based methods, which discover effective semantic vectors without relying on labeled data, have attracted considerable attention in the research community. However, existing methods still encounter difficulties in disentanglement, as manipulating a specific facial attribute may unintentionally affect other attributes, complicating fine-grained controllability. To address these challenges, we propose a novel framework designed to offer an effective and adaptable solution for unsupervised facial attribute editing, called Unsupervised Facial Attribute Controllable Editing (U-Face). The proposed method frames semantic vector learning as a subspace learning problem, where latent vectors are approximated within a lower-dimensional semantic subspace spanned by a semantic vector matrix. This formulation can also be equivalently interpreted from a projection-reconstruction perspective and further generalized into an autoencoder framework, providing a foundation that can support disentangled representation learning in a flexible manner. To improve disentanglement and controllability, we impose orthogonal non-negative constraints on the semantic vectors and incorporate attribute boundary vectors to reduce entanglement in the learned directions. Although these constraints make the optimization problem challenging, we design an alternating iterative algorithm, called Alternating Iterative Disentanglement and Controllability (AIDC), with closed-form updates and provable convergence under specific conditions.

263. 【2603.14001】PhyGaP: Physically-Grounded Gaussians with Polarization Cues

链接https://arxiv.org/abs/2603.14001

作者:Jiale Wu,Xiaoyang Bai,Zongqi He,Weiwei Xu,Yifan Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, demonstrated great success, Recent advances, demonstrated great, great success

备注: The paper is accepted by CVPR 2026

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via deferred rendering (DR). However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of shape and material information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability. Our code will be released soon.

264. 【2603.13994】Human-like Object Grouping in Self-supervised Vision Transformers

链接https://arxiv.org/abs/2603.13994

作者:Hossein Adeli,Seoyoung Ahn,Andrew Luo,Mengmi Zhang,Nikolaus Kriegeskorte,Gregory Zelinsky

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)

关键词:objectives achieve strong, exhibit emergent object, achieve strong performance, achieve strong, tasks and exhibit

备注

点击查看摘要

Abstract:Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.

265. 【2603.13993】VAD4Space: Visual Anomaly Detection for Planetary Surface Imagery

链接https://arxiv.org/abs/2603.13993

作者:Fabrizio Genilotti,Arianna Stropeni,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Space missions generate, missions generate massive, generate massive volumes, Space missions, manual inspection

备注

点击查看摘要

Abstract:Space missions generate massive volumes of high-resolution orbital and surface imagery that far exceed the capacity for manual inspection. Detecting rare phenomena is scientifically critical, yet traditional supervised learning struggles due to scarce labeled examples and closed-world assumptions that prevent discovery of genuinely novel observations. In this work, we investigate Visual Anomaly Detection (VAD) as a framework for automated discovery in planetary exploration. We present the first empirical evaluation of state-of-the-art feature-based VAD methods on real planetary imagery, encompassing both orbital lunar data and Mars rover surface imagery. To support this evaluation, we introduce two benchmarks: (i) a lunar dataset derived from Lunar Reconnaissance Orbiter Camera Narrow Angle imagery, comprising of fresh and degraded craters as anomalies alongside normal terrain; and (ii) a Mars surface dataset designed to reflect the characteristics of rover-acquired imagery. We evaluate multiple VAD approaches with a focus on computationally efficient, edge-oriented solutions suitable for onboard deployment, applicable to both orbital platforms surveying the lunar surface and surface rovers operating on Mars. Our results demonstrate that feature-based VAD methods can effectively identify rare planetary surface phenomena while remaining feasible for resource-constrained environments. By grounding anomaly detection in planetary science, this work establishes practical benchmarks and highlights the potential of open-world perception systems to support a range of mission-critical applications, including tactical planning, landing site selection, hazard detection, bandwidth-aware data prioritization, and the discovery of unanticipated geological processes.

266. 【2603.13978】When Visual Privacy Protection Meets Multimodal Large Language Models

链接https://arxiv.org/abs/2603.13978

作者:Xiaofei Hui,Qian Wu,Haoxuan Qu,Majid Mirmehdi,Hossein Rahmani,Jun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, raised great concerns

备注

点击查看摘要

Abstract:The emergence of Multimodal Large Language Models (MLLMs) and the widespread usage of MLLM cloud services such as GPT-4V raised great concerns about privacy leakage in visual data. As these models are typically deployed in cloud services, users are required to submit their images and videos, posing serious privacy risks. However, how to tackle such privacy concerns is an under-explored problem. Thus, in this paper, we aim to conduct a new investigation to protect visual privacy when enjoying the convenience brought by MLLM services. We address the practical case where the MLLM is a "black box", i.e., we only have access to its input and output without knowing its internal model information. To tackle such a challenging yet demanding problem, we propose a novel framework, in which we carefully design the learning objective with Pareto optimality to seek a better trade-off between visual privacy and MLLM's performance, and propose critical-history enhanced optimization to effectively optimize the framework with the black-box MLLM. Our experiments show that our method is effective on different benchmarks.

267. 【2603.13969】Leveraging a Statistical Shape Model for Efficient Generation of Annotated Training Data: A Case Study on Liver Landmarks Segmentation

链接https://arxiv.org/abs/2603.13969

作者:Denis Krnjaca,Lorena Krames,Werner Nahm

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:critical initial step, robust multimodal registration, Anatomical landmark segmentation, landmark segmentation serves, computer-assisted interventions

备注

点击查看摘要

Abstract:Anatomical landmark segmentation serves as a critical initial step for robust multimodal registration during computer-assisted interventions. Current approaches predominantly rely on deep learning, which often necessitates the extensive manual generation of annotated datasets. In this paper, we present a novel strategy for creating large annotated datasets using a statistical shape model (SSM) based on a mean shape that is manually labeled only once. We demonstrate the method's efficacy through its application to deep-learning-based anatomical landmark segmentation, specifically targeting the detection of the anterior ridge and the falciform ligament in 3D liver shapes. A specialized deep learning network was trained with 8,800 annotated liver shapes generated by the SSM. The network's performance was evaluated on 500 unseen synthetic SSM shapes, yielding a mean Intersection over Union of 91.4% (87.4% for the anterior ridge and 87.6% for the falciform ligament). Subsequently, the network was applied to clinical patient liver shapes, with qualitative evaluation indicating promising results and highlighting the generalizability of the proposed approach. Our findings suggest that the SSM-based data generation approach alleviates the labor-intensive process of manual labeling while enabling the creation of large annotated training datasets for machine learning. Although our study focuses on liver anatomy, the proposed methodology holds potential for a broad range of applications where annotated training datasets play a pivotal role in developing accurate deep-learning models.

268. 【2603.13964】VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

链接https://arxiv.org/abs/2603.13964

作者:Hiroto Nakata,Yawen Zou,Shunsuke Sakai,Shun Maeda,Chunzhi Gu,Yijin Wei,Shangce Gao,Chao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:industrial inspection remains, inspection remains challenging, remains challenging due, distract vision-centric detectors, identifying rule-level violations

备注

点击查看摘要

Abstract:Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: this https URL.

269. 【2603.13961】USIS-PGM: Photometric Gaussian Mixtures for Underwater Salient Instance Segmentation

链接https://arxiv.org/abs/2603.13961

作者:Lin Hong,Xiangtong Yao,Mürüvvet Bozkurt,Xin Wang,Fumin Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:marine robotic systems, visual scene understanding, salient object detection, underwater salient object, robotic systems

备注

点击查看摘要

Abstract:Underwater salient instance segmentation (USIS) is crucial for marine robotic systems, as it enables both underwater salient object detection and instance-level mask prediction for visual scene understanding. Compared with its terrestrial counterpart, USIS is more challenging due to the underwater image degradation. To address this issue, this paper proposes USIS-PGM, a single-stage framework for USIS. Specifically, the encoder enhances boundary cues through a frequency-aware module and performs content-adaptive feature reweighting via a dynamic weighting module. The decoder incorporates a Transformer-based instance activation module to better distinguish salient instances. In addition, USIS-PGM employs multi-scale Gaussian heatmaps generated from ground-truth masks through Photometric Gaussian Mixture (PGM) to supervise intermediate decoder features, thereby improving salient instance localization and producing more structurally coherent mask predictions. Experimental results demonstrate the superiority and practical applicability of the proposed USIS-PGM model.

270. 【2603.13960】IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation

链接https://arxiv.org/abs/2603.13960

作者:Chenru Wang,Yunyi Chen,Zijun Yang,Joey Tianyi Zhou,Chi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern deep learning, increasing computational demands, Dataset Distillation aims, synthesize compact datasets, large-scale real datasets

备注: CVPR26 Accepted

点击查看摘要

Abstract:Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of large-scale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling(S^3) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.

271. 【2603.13951】DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction

链接https://arxiv.org/abs/2603.13951

作者:Jing Wang,Huimin Shi,Quan Zhou,Qibo Liu,Suofei Zhang,Huimin Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:insufficient cross-modal communications, significant computational costs, visual-language foundation models, fundamental challenges, insufficient cross-modal

备注: 13 pages, 7 figures

点击查看摘要

Abstract:The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.

272. 【2603.13943】Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing

链接https://arxiv.org/abs/2603.13943

作者:Kursat Komurcu,Linas Petkevicius

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Predicting satellite imagery, satellite imagery requires, Predicting satellite, textural detail, satellite imagery

备注: ICLR 2026 Workshop ML4RS Main Track: [this https URL](https://openreview.net/forum?id=WBHfQLbgZR)

点击查看摘要

Abstract:Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on this https URL.

273. 【2603.13941】Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting

链接https://arxiv.org/abs/2603.13941

作者:Jonas V. Funk,Lukas Roming,Andreas Michel,Paul Bäcker,Georg Maier,Thomas Längle,Markus Klute

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Growing waste streams, circular economy require, economy require efficient, require efficient automated, efficient automated waste

备注: Submitted to Information Fusion (Elsevier). 23 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).

274. 【2603.13928】Discriminative Flow Matching Via Local Generative Predictors

链接https://arxiv.org/abs/2603.13928

作者:Om Govind Jha,Manoj Bamniya,Ayon Borthakur

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Traditional discriminative computer, mapping input features, single computational step, computer vision relies, vision relies predominantly

备注

点击查看摘要

Abstract:Traditional discriminative computer vision relies predominantly on static projections, mapping input features to outputs in a single computational step. Although efficient, this paradigm lacks the iterative refinement and robustness inherent in biological vision and modern generative modelling. In this paper, we propose Discriminative Flow Matching, a framework that reformulates classification and object detection as a conditional transport process. By learning a vector field that continuously transports samples from a simple noise distribution toward a task-aligned target manifold -- such as class embeddings or bounding box coordinates -- we are at the interface between generative and discriminative learning. Our method attaches multiple independent flow predictors to a shared backbone. These predictors are trained using local flow matching objectives, where gradients are computed independently for each block. We formulate this approach for standard image classification and extend it to the complex task of object detection, where targets are high-dimensional and spatially distributed. This architecture provides the flexibility to update blocks either sequentially to minimise activation memory or in parallel to suit different hardware constraints. By aggregating the predictions from these independent flow predictors, our framework enables robust, generative-inspired inference across diverse architectures, including CNNs and vision transformers.

275. 【2603.13919】OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

链接https://arxiv.org/abs/2603.13919

作者:Xianke Wu,Songlin Bai,Chengxiang Li,Zhiyao Luo,Yulin Tian,Fenghua Zhu,Yisheng Lv,Yonglin Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:collaboration extends sensing, extends sensing ranges, reliability remains severely, remains severely constrained, perception blind spots

备注

点击查看摘要

Abstract:While Vehicle-to-Vehicle (V2V) collaboration extends sensing ranges through multi-agent data sharing, its reliability remains severely constrained by ground-level occlusions and the limited perspective of chassis-mounted sensors, which often result in critical perception blind spots. We propose OpenCOOD-Air, a novel framework that integrates UAVs as extensible platforms into V2V collaborative perception to overcome these constraints. To mitigate gradient interference from ground-air domain gaps and data sparsity, we adopt a transfer learning strategy to fine-tune UAV weights from pre-trained V2V models. To prevent the spatial information loss inherent in this transition, we formulate ground-air collaborative perception as a heterogeneous integration task with explicit altitude supervision and introduce a Cross-Domain Spatial Converter (CDSC) and a Spatial Offset Prediction Transformer (SOPT). Furthermore, we present the OPV2V-Air benchmark to validate the transition from V2V to Vehicle-to-Vehicle-to-UAV. Compared to state-of-the-art methods, our approach improves 2D and 3D AP@0.7 by 4% and 7%, respectively.

276. 【2603.13917】Evaluation of Visual Place Recognition Methods for Image Pair Retrieval in 3D Vision and Robotics

链接https://arxiv.org/abs/2603.13917

作者:Dennis Haitz,Athradi Shritish Shetty,Michael Weinmann,Markus Ulrich

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Place Recognition, Visual Place, Place Recognition, image pair retrieval, image retrieval task

备注: Accepted at the XXV ISPRS Congress 2026; to appear in the ISPRS Annals

点击查看摘要

Abstract:Visual Place Recognition (VPR) is a core component in computer vision, typically formulated as an image retrieval task for localization, mapping, and navigation. In this work, we instead study VPR as an image pair retrieval front-end for registration pipelines, where the goal is to find top-matching image pairs between two disjoint image sets for downstream tasks such as scene registration, SLAM, and Structure-from-Motion. We comparatively evaluate state-of-the-art VPR families - NetVLAD-style baselines, classification-based global descriptors (CosPlace, EigenPlaces), feature-mixing (MixVPR), and foundation-model-driven methods (AnyLoc, SALAD, MegaLoc) - on three challenging datasets: object-centric outdoor scenes (Tanks and Temples), indoor RGB-D scans (ScanNet-GS), and autonomous-driving sequences (KITTI). We show that modern global descriptor approaches are increasingly suitable as off-the-shelf image pair retrieval modules in challenging scenarios including perceptual aliasing and incomplete sequences, while exhibiting clear, domain-dependent strengths and weaknesses that are critical when choosing VPR components for robust mapping and registration.

277. 【2603.13912】owards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

链接https://arxiv.org/abs/2603.13912

作者:Yuting Tan,Xilong Cheng,Yunxiao Qin,Zhengnan Li,Jingjing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Humans develop visual, Humans develop, self-supervised learning process, learning process grounded, perceiving and interacting

备注: 24 pages, 11 figures. Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

278. 【2603.13910】Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

链接https://arxiv.org/abs/2603.13910

作者:Stefan Ainetter,Thomas Deixelberger,Edoardo A. Dominici,Philipp Drescher,Konstantinos Vardis,Markus Steinberger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically interpretable indoor, interpretable indoor scenes, produces metrically accurate, framework that produces, produces metrically

备注

点击查看摘要

Abstract:We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360° imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.

279. 【2603.13907】LineMaster Pro: A Low-Cost Intelligent Line Following Robot with PID Control and Ultrasonic Obstacle Avoidance for Educational Robotics

链接https://arxiv.org/abs/2603.13907

作者:Jeni Shahi,Abhishek Shah,A. S. M. Ahsanul Sarkar Akib

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:remain prohibitively expensive, solutions remain prohibitively, detection capabilities essential, lacking integrated obstacle, Arduino Nano platform

备注

点击查看摘要

Abstract:Line following robots are fundamental platforms in robotics education, yet commercially available solutions remain prohibitively expensive ($150-300$) while lacking integrated obstacle detection capabilities essential for real-world applications. This paper presents LineMaster Pro, an intelligent low-cost line following robot implemented on an Arduino Nano platform that integrates dual TCRT5000 infrared sensors for precision line tracking, an HC-SR04 ultrasonic sensor for real-time obstacle detection, a digitally tuned PID controller with Ziegler-Nichols optimization, and a hierarchical finite state machine for robust obstacle avoidance. A systematic four-phase sensor calibration methodology ensures reliable operation across varying lighting and surface conditions. Experimental validation through 200 controlled trials and 72-hour continuous operation demonstrates mean tracking accuracy of 1.18 cm at 0.4 m/s (95\% CI [1.06, 1.30]), obstacle detection reliability of 96.7\% within 10-40 cm range with 0.7\% false positive rate, and 94\% successful recovery from path deviations. The PID implementation achieves 43\% improvement over conventional on-off control ($p0.001$). At a total hardware cost of \$28.50 based on verified Bangladesh market prices, LineMaster Pro achieves a 94\% cost reduction compared to commercial alternatives, establishing a practical benchmark for accessible robotics education in resource-constrained environments.

280. 【2603.13904】Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

链接https://arxiv.org/abs/2603.13904

作者:Seokmin Lee,Yunghee Lee,Byeonghyun Pak,Byeongju Woo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:robotic agents operating, streaming video observations, visual state, robotic agents, agents operating

备注: Preprint

点击查看摘要

Abstract:For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.

281. 【2603.13901】CT-Conditioned Diffusion Prior with Physics-Constrained Sampling for PET Super-Resolution

链接https://arxiv.org/abs/2603.13901

作者:Liutao Yang,Zi Wang,Peiyuan Jing,Xiaowen Wang,Javier A. Montoya-Zegarra,Kuangyu Shi,Daoqiang Zhang,Guang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paired multi-resolution scans, detector geometry, scanner-specific physics, highly under-constrained, multi-resolution scans

备注

点击查看摘要

Abstract:PET super-resolution is highly under-constrained because paired multi-resolution scans from the same subject are rarely available, and effective resolution is determined by scanner-specific physics (e.g., PSF, detector geometry, and acquisition settings). This limits supervised end-to-end training and makes purely image-domain generative restoration prone to hallucinated structures when anatomical and physical constraints are weak. We formulate PET super-resolution as posterior inference under heterogeneous system configurations and propose a CT-conditioned diffusion framework with physics-constrained sampling. During training, a conditional diffusion prior is learned from high-quality PET/CT pairs using cross-attention for anatomical guidance, without requiring paired LR--HR PET data. During inference, measurement consistency is enforced through a scanner-aware forward model with explicit PSF effects and gradient-based data-consistency refinement. Under both standard and OOD settings, the proposed method consistently improves experimental metrics and lesion-level clinical relevance indicators over strong baselines, while reducing hallucination artifacts and improving structural fidelity.

282. 【2603.13894】Robust Self-Training with Closed-loop Label Correction for Learning from Noisy Labels

链接https://arxiv.org/abs/2603.13894

作者:Zhanhui Lin,Yanlin Liu,Sanping Zhou

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, significant challenge, Training deep neural, remains a significant, leading to degraded

备注

点击查看摘要

Abstract:Training deep neural networks with noisy labels remains a significant challenge, often leading to degraded performance. Existing methods for handling label noise typically rely on either transition matrix, noise detection, or meta-learning techniques, but they often exhibit low utilization efficiency of noisy samples and incur high computational costs. In this paper, we propose a self-training label correction framework using decoupled bilevel optimization, where a classifier and neural correction function co-evolve. Leveraging a small clean dataset, our method employs noisy posterior simulation and intermediate features to transfer ground-truth knowledge, forming a closed-loop feedback system that prevents error amplification. Theoretical guarantees underpin the stability of our approach, and extensive experiments on benchmark datasets like CIFAR and Clothing1M confirm state-of-the-art performance with reduced training time, highlighting its practical applicability for learning from noisy labels.

283. 【2603.13893】UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking

链接https://arxiv.org/abs/2603.13893

作者:Joan Perez,Giovanni Fusco

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:practical deployment remains, deployment remains hindered, significant architectural heterogeneity, Universal Vision-Language Model, Vision-Language Model Loader

备注: 22 pages, 3 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have emerged as powerful tools for image understanding tasks, yet their practical deployment remains hindered by significant architectural heterogeneity across model families. This paper introduces UVLM (Universal Vision-Language Model Loader), a Google Colab-based framework that provides a unified interface for loading, configuring, and benchmarking multiple VLM architectures on custom image analysis tasks. UVLM currently supports two major model families -- LLaVA-NeXT and Qwen2.5-VL -- which differ fundamentally in their vision encoding, tokenization, and decoding strategies. The framework abstracts these differences behind a single inference function, enabling researchers to compare models using identical prompts and evaluation protocols. Key features include a multi-task prompt builder with support for four response types (numeric, category, boolean, text), a consensus validation mechanism based on majority voting across repeated inferences, a flexible token budget (up to 1,500 tokens) enabling users to design custom reasoning strategies through prompt engineering, and a built-in chain-of-thought reference mode for benchmarking. UVLM is designed for reproducibility, accessibility, and extensibility and as such is freely deployable on Google Colab using consumer-grade GPU resources. The paper also presents the first benchmarking of different VLMs on tasks of increasing reasoning complexity using a corpus of 120 street-view images.

284. 【2603.13886】Multi-Modal Character Localization and Extraction for Chinese Text Recognition

链接https://arxiv.org/abs/2603.13886

作者:Qilong Li,Chongsheng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Scene text recognition, English text recognition, English text images, recognizing Chinese text, Chinese text images

备注: On January 08th, 2026, this paper has been accepted by the IEEE Transactions on Multimedia journal. To appear

点击查看摘要

Abstract:Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at this https URL.

285. 【2603.13884】SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis

链接https://arxiv.org/abs/2603.13884

作者:Ehud Gordon,Meir Yossef Levi,Guy Gilboa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Interpreting the internal, safety-critical domains, internal reasoning, reasoning of vision-language, essential for deploying

备注

点击查看摘要

Abstract:Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.

286. 【2603.13879】Dual-Strategy Improvement of YOLOv11n for Multi-Scale Object Detection in Remote Sensing Images

链接https://arxiv.org/abs/2603.13879

作者:Shuaiyu Zhu,Sergey Ablameyko

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Satellite remote sensing, pose significant challenges, remote sensing images, images pose significant, Satellite remote

备注: 14 pages, 8 figures

点击查看摘要

Abstract:Satellite remote sensing images pose significant challenges for object detection due to their high resolution, complex scenes, and large variations in target scales. To address the insufficient detection accuracy of the YOLOv11n model in remote sensing imagery, this paper proposes two improvement strategies. Method 1: (a) a Large Separable Kernel Attention (LSKA) mechanism is introduced into the backbone network to enhance feature extraction for small objects; (b) a Gold-YOLO structure is incorporated into the neck network to achieve multi-scale feature fusion, thereby improving the detection performance of objects at different scales. Method 2: (a) the Gold-YOLO structure is also integrated into the neck network; (b) a MultiSEAMHead detection head is combined to further strengthen the representation and detection capability for small and multi-scale objects. To verify the effectiveness of the proposed improvements, experiments are conducted on the DOTAv1 dataset. The results show that, while maintaining the lightweight advantage of the model, the proposed methods improve detection accuracy (mAP@0.5) by 1.3% and 1.8%, respectively, compared with the baseline YOLOv11n, demonstrating the effectiveness and practical value of the proposed approaches for object detection in remote sensing images.

287. 【2603.13878】Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

链接https://arxiv.org/abs/2603.13878

作者:Lin Fan,Yafei Ou,Zhipeng Deng,Pengyu Dai,Hou Chongxian,Jiale Yan,Yaqian Li,Kaiwen Long,Xun Gong,Masayuki Ikebe,Yefeng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:visual question answering, advanced medical visual, medical visual question, existing CoT rationales, reasoning process clinicians

备注: Accepted by CVPR 2026 Finding Track

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: this http URL. Dataset Card: this http URL

288. 【2603.13874】Zero-Forgetting CISS via Dual-Phase Cognitive Cascades

链接https://arxiv.org/abs/2603.13874

作者:Yuquan Lu,Yifu Guo,Zishan Xu,Siyu Zhang,Yu Huo,Siyue Chen,Siyan Wu,Chenghua Zhu,Ruixuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:catastrophic forgetting, catastrophic forgetting challenge, catastrophic forgetting originates, Strict Parameter Isolation, Continual semantic segmentation

备注

点击查看摘要

Abstract:Continual semantic segmentation (CSS) is a cornerstone task in computer vision that enables a large number of downstream applications, but faces the catastrophic forgetting challenge. In conventional class-incremental semantic segmentation (CISS) frameworks using Softmax-based classification heads, catastrophic forgetting originates from Catastrophic forgetting and task affiliation probability. We formulate these problems and provide a theoretical analysis to more deeply understand the limitations in existing CISS methods, particularly Strict Parameter Isolation (SPI). To address these challenges, we follow a dual-phase intuition from human annotators, and introduce Cognitive Cascade Segmentation (CogCaS), a novel dual-phase cascade formulation for CSS tasks in the CISS setting. By decoupling the task into class-existence detection and class-specific segmentation, CogCaS enables more effective continual learning, preserving previously learned knowledge while incorporating new classes. Using two benchmark datasets PASCAL VOC 2012 and ADE20K, we have shown significant improvements in a variety of challenging scenarios, particularly those with long sequence of incremental tasks, when compared to exsiting state-of-the-art methods. Our code will be made publicly available upon paper acceptance.

289. 【2603.13864】Inevitable Encounters: Backdoor Attacks Involving Lossy Compression

链接https://arxiv.org/abs/2603.13864

作者:Qian Li,Yunuo Chen,Yuntian Chen

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep learning systems, compromise deep learning, Real-world backdoor attacks, Real-world backdoor, learning systems

备注

点击查看摘要

Abstract:Real-world backdoor attacks often require poisoned datasets to be stored and transmitted before being used to compromise deep learning systems. However, in the era of big data, the inevitable use of lossy compression poses a fundamental challenge to invisible backdoor attacks. We find that triggers embedded in RGB images often become ineffective after the images are lossily compressed into binary bitstreams (e.g., JPEG files) for storage and transmission. As a result, the poisoned data lose its malicious effect after compression, causing backdoor injection to fail. In this paper, we highlight the necessity of explicitly accounting for the lossy compression process in backdoor attacks. This requires attackers to ensure that the transmitted binary bitstreams preserve malicious trigger information, so that effective triggers can be recovered in the decompressed data. Building on the region-of-interest (ROI) coding mechanism in image compression, we propose two poisoning strategies tailored to inevitable lossy compression. First, we introduce Universal Attack Activation, a universal method that uses sample-specific ROI masks to reactivate trigger information in binary bitstreams for learned image compression (LIC). Second, we present Compression-Adapted Attack, a new attack strategy that employs customized ROI masks to encode trigger information into binary bitstreams and is applicable to both traditional codecs and LIC. Extensive experiments demonstrate the effectiveness of both strategies.

290. 【2603.13859】Geo-ID: Test-Time Geometric Consensus for Cross-View Consistent Intrinsics

链接https://arxiv.org/abs/2603.13859

作者:Alara Dirik,Stefanos Zafeiriou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:estimate physically based, physically based rendering, physically based, PBR, image decomposition aims

备注

点击查看摘要

Abstract:Intrinsic image decomposition aims to estimate physically based rendering (PBR) parameters such as albedo, roughness, and metallicity from images. While recent methods achieve strong single-view predictions, applying them independently to multiple views of the same scene often yields inconsistent estimates, limiting their use in downstream applications such as editable neural scenes and 3D reconstruction. Video-based models can improve cross-frame consistency but require dense, ordered sequences and substantial compute, limiting their applicability to sparse, unordered image collections. We propose Geo-ID, a novel test-time framework that repurposes pretrained single-view intrinsic predictors to produce cross-view consistent decompositions by coupling independent per-view predictions through sparse geometric correspondences that form uncertainty-aware consensus targets. Geo-ID is model-agnostic, requires no retraining or inverse rendering, and applies directly to off-the-shelf intrinsic predictors. Experiments on synthetic benchmarks and real-world scenes demonstrate substantial improvements in cross-view intrinsic consistency as the number of views increases, while maintaining comparable single-view decomposition performance. We further show that the resulting consistent intrinsics enable coherent appearance editing and relighting in downstream neural scene representations.

291. 【2603.13858】Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery

链接https://arxiv.org/abs/2603.13858

作者:Bohan Zhang,Weidong Tang,Zhixiang Chi,Yi Jin,Zhenbo Li,Yang Wang,Yanan Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simultaneously discovering emerging, Category Discovery, recognize known classes, aims to recognize, simultaneously discovering

备注: Accepted to CVPR 2026 Findings. Code available at [this https URL](https://github.com/brandinzhang/LTC)

点击查看摘要

Abstract:On-the-Fly Category Discovery (OCD) aims to recognize known classes while simultaneously discovering emerging novel categories during inference, using supervision only from known classes during offline training. Existing approaches rely either on fixed label supervision or on diffusion-based augmentations to enhance the backbone, yet none of them explicitly train the model to perform the discovery task required at test time. It is fundamentally unreasonable to expect a model optimized on limited labeled data to carry out a qualitatively different discovery objective during inference. This mismatch creates a clear optimization misalignment between the offline learning stage and the online discovery stage. In addition, prior methods often depend on hash-based encodings or severe feature compression, which further limits representational capacity. To address these issues, we propose Learning through Creation (LTC), a fully feature-based and hash-free framework that injects novel-category awareness directly into offline learning. At its core is a lightweight, online pseudo-unknown generator driven by kernel-energy minimization and entropy maximization (MKEE). Unlike previous methods that generate synthetic samples once before training, our generator evolves jointly with the model dynamics and synthesizes pseudo-novel instances on the fly at negligible cost. These samples are incorporated through a dual max-margin objective with adaptive thresholding, strengthening the model's ability to delineate and detect unknown regions through explicit creation. Extensive experiments across seven benchmarks show that LTC consistently outperforms prior work, achieving improvements ranging from 1.5 percent to 13.1 percent in all-class accuracy. The code is available at this https URL

292. 【2603.13856】OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

链接https://arxiv.org/abs/2603.13856

作者:Naaisha Agarwal,Yihan Wu,Yichang Jian,Yikuan Hu,Nishad Mansoor,Mohan Li,Yifei Peng,Wang-Zhou Dai,Yao-Xiang Ding,Emanuele Sansone

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Building AI systems, Building, physical world requires, pattern recognition, world requires

备注

点击查看摘要

Abstract:Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.

293. 【2603.13855】VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

链接https://arxiv.org/abs/2603.13855

作者:Jun Lu,Zehao Sang,Haoqi Wei,Xiangyun Liu,Kun Zhu,Haitao Guo,Zhihui Gong,Lei Ding

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geo-tagged satellite images, remote sensing aims, satellite images, remote sensing, sensing aims

备注

点击查看摘要

Abstract:Cross-View Geo-Localization (CVGL) in remote sensing aims to locate a drone-view query by matching it to geo-tagged satellite images. Although supervised methods have achieved strong results on closeset benchmarks, they often fail to generalize to unconstrained, real-world scenarios due to severe viewpoint differences and dataset bias. To overcome these limitations, we present VFM-Loc, a training-free framework for zero-shot CVGL that leverages the generalizable visual representations from vision foundational models (VFMs). VFM-Loc identifies and matches discriminative visual clues across different viewpoints through a progressive alignment strategy. First, we design a hierarchical clue extraction mechanism using Generalized Mean pooling and Scale-Weighted RMAC to preserve distinctive visual clues across scales while maintaining hierarchical confidence. Second, we introduce a statistical manifold alignment pipeline based on domain-wise PCA and Orthogonal Procrustes analysis, linearly aligning heterogeneous feature distributions in a shared metric space. Experiments demonstrate that VFM-Loc exhibits strong zero-shot accuracy on standard benchmarks and surpasses supervised methods by over 20% in Recall@1 on the challenging LO-UCV dataset with large oblique angles. This work highlights that principled alignment of pre-trained features can effectively bridge the cross-view gap, establishing a robust and training-free paradigm for real-world CVGL. The relevant code is made available at: this https URL.

294. 【2603.13843】MOGeo: Beyond One-to-One Cross-View Object Geo-localization

链接https://arxiv.org/abs/2603.13843

作者:Bo Lv,Qingwang Zhang,Le Wu,Yuanyuan Li,Yingying Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Cross-View Object Geo-Localization, aims to locate, Cross-View Multi-Object Geo-Localization, query image, multi-object geo-localization

备注

点击查看摘要

Abstract:Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.

295. 【2603.13831】Efficient Semi-Automated Material Microstructure Analysis Using Deep Learning: A Case Study in Additive Manufacturing

链接https://arxiv.org/abs/2603.13831

作者:Sanjeev S. Navaratna,Nikhil Thawari,Gunashekhar Mari,Amritha V P,Murugaiyan Amirthalingam,Rohit Batra

类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)

关键词:remains challenging due, structure-property correlation, testing conditions, identification and structure-property, remains challenging

备注

点击查看摘要

Abstract:Image segmentation is fundamental to microstructural analysis for defect identification and structure-property correlation, yet remains challenging due to pronounced heterogeneity in materials images arising from varied processing and testing conditions. Conventional image processing techniques often fail to capture such complex features rendering them ineffective for large-scale analysis. Even deep learning approaches struggle to generalize across heterogeneous datasets due to scarcity of high-quality labeled data. Consequently, segmentation workflows often rely on manual expert-driven annotations which are labor intensive and difficult to scale. Using an additive manufacturing (AM) dataset as a case study, we present a semi-automated active learning based segmentation pipeline that integrates a U-Net based convolutional neural network with an interactive user annotation and correction interface and a representative core-set image selection strategy. The active learning workflow iteratively updates the model by incorporating user corrected segmentations into the training pool while the core-set strategy identifies representative images for annotation. Three subset selection strategies, manual selection, uncertainty driven sampling and proposed maximin Latin hypercube sampling from embeddings (SMILE) method were evaluated over six refinement rounds. The SMILE strategy consistently outperformed other approaches, improving the macro F1 score from 0.74 to 0.93 while reducing manual annotation time by about 65 percent. The segmented defect regions were further analyzed using a coupled classification model to categorize defects based on microstructural characteristics and map them to corresponding AM process parameters. The proposed framework reduces labeling effort while maintaining scalability and robustness and is broadly applicable to image based analysis across diverse materials systems.

296. 【2603.13818】PA-Net: Precipitation-Adaptive Mixture-of-Experts for Long-Tail Rainfall Nowcasting

链接https://arxiv.org/abs/2603.13818

作者:Xinyu Xiao,Sen Lei,Eryun Liu,Shiming Xiang,Hao Li,Cheng Yuan,Yuan Qi,Qizhao Jin

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multi-variate atmospheric fields, greatest societal impact, extreme long-tailed rainfall, long-tailed rainfall distribution, modeling million-scale spatiotemporal

备注

点击查看摘要

Abstract:Precipitation nowcasting is vital for flood warning, agricultural management, and emergency response, yet two bottlenecks persist: the prohibitive cost of modeling million-scale spatiotemporal tokens from multi-variate atmospheric fields, and the extreme long-tailed rainfall distribution where heavy-to-torrential events -- those of greatest societal impact -- constitute fewer than 0.1% of all samples. We propose the Precipitation-Adaptive Network (PA-Net), a Transformer framework whose computational budget is explicitly governed by rainfall intensity. Its core component, Precipitation-Adaptive MoE (PA-MoE), dynamically scales the number of activated experts per token according to local precipitation magnitude, channeling richer representational capacity toward the rare yet critical heavy-rainfall tail. A Dual-Axis Compressed Latent Attention mechanism factorizes spatiotemporal attention with convolutional reduction to manage massive context lengths, while an intensity-aware training protocol progressively amplifies learning signals from extreme-rainfall samples. Experiment on ERA5 demonstrate consistent improvements over state-of-the-art baselines, with particularly significant gains in heavy-rain and rainstorm regimes.

297. 【2603.13803】ALTIS: Automated Loss Triage and Impact Scoring from Sentinel-1 SAR for Property-Level Flood Damage Assessment

链接https://arxiv.org/abs/2603.13803

作者:Amogh Vinaykumar,Prem Kamasani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural catastrophes globally, costliest natural catastrophes, industry post-event response, post-event response remains, response remains heavily

备注: 27 pages, 9 figures. Preliminary results; full end-to-end validation ongoing

点击查看摘要

Abstract:Floods are among the costliest natural catastrophes globally, yet the property and casualty insurance industry's post-event response remains heavily reliant on manual field inspection: slow, expensive, and geographically constrained. Satellite Synthetic Aperture Radar (SAR) offers cloud-penetrating, all-weather imaging uniquely suited to rapid post-flood assessment, but existing research evaluates SAR flood detection against academic benchmarks such as IoU and F1-score that do not capture insurance-workflow requirements. We present ALTIS: a five-stage pipeline transforming raw Sentinel-1 GRD and SLC imagery into property-level impact scores within 24-48 hours of flood peak. Unlike prior approaches producing pixel-level maps or binary outputs, ALTIS delivers a ranked, confidence-scored triage list consumable by claims platforms, integrating (i) multi-temporal SAR change detection using dual-polarization VV/VH intensity and InSAR coherence, (ii) physics-informed depth estimation fusing flood extent with high-resolution DEMs, (iii) property-level zonal statistics from parcel footprints, (iv) depth-damage calibration against NFIP claims, and (v) confidence-scored triage ranking. We formally define Insurance-Grade Flood Triage (IGFT) and introduce the Inspection Reduction Rate (IRR) and Triage Efficiency Score (TES). Using Hurricane Harvey (2017) across Harris County, Texas, we present preliminary analysis grounded in validated sub-components suggesting ALTIS is designed to achieve an IRR of approximately 0.52 at 90% recall of high-severity claims, potentially eliminating over half of unnecessary dispatches. By blending SAR flood intelligence with the realities of claims management, ALTIS establishes a methodological baseline for translating earth observation research into measurable insurance outcomes.

298. 【2603.13800】Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

链接https://arxiv.org/abs/2603.13800

作者:Quoc-Huy Trinh,Xi Ding,Yang Liu,Zhenyue Qin,Xingjian Li,Gorkem Durak,Halil Ertugrul Aktas,Elif Keles,Ulas Bagci,Min Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, remains largely unexplored

备注

点击查看摘要

Abstract:Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.

299. 【2603.13787】Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective

链接https://arxiv.org/abs/2603.13787

作者:Junjie Zhou,Bao Xue,Meiling Wang,Wei Shao,Daoqiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:integrating genomic data, cancer prognosis, recent research, enhance the precision, precision of cancer

备注

点击查看摘要

Abstract:To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene-Regulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.

300. 【2603.13783】RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting

链接https://arxiv.org/abs/2603.13783

作者:Xuezhen Wang,Li Ma,Yulin Shen,Zeyu Wang,Pedro V. Sander

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:render dynamic scenes, slow-motion playback, ability to reconstruct, reconstruct and render, render dynamic

备注: Accepted to CVPR2026

点击查看摘要

Abstract:Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow-guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.

301. 【2603.13782】Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

链接https://arxiv.org/abs/2603.13782

作者:Jaehwan Jeong,Evelyn Zhu,Jinying Lin,Emmanuel Jaimes,Tuan-Anh Vu,Jungseock Joo,Sangpil Kim,M. Khalid Jawed

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong potential, predicting semantic actions, demonstrating the ability, demonstrated strong, strong potential

备注: Keywords: Vision-Language Action (VLA), Reinforcement Learning (RL), Navigation Path Recovery, Robot Operating System (ROS)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.

302. 【2603.13779】AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

链接https://arxiv.org/abs/2603.13779

作者:Xi Jiang,Yue Guo,Jian Li,Yong Liu,Bin-Bin Gao,Hanqiu Deng,Jun Liu,Heng Zhao,Chengjie Wang,Feng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, achieved impressive success, Large Language Models, Large Language, natural visual understanding

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.

303. 【2603.13771】Brain Tumor Classification from 3D MRI Using Persistent Homology and Betti Features: A Topological Data Analysis Approach on BraTS2020

链接https://arxiv.org/abs/2603.13771

作者:Faisal Ahmed

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:magnetic resonance imaging, challenging problem due, medical imaging remains, MRI volumes, Fluid Attenuated Inversion

备注: 21 pages, 7 figures

点击查看摘要

Abstract:Accurate and interpretable brain tumor classification from medical imaging remains a challenging problem due to the high dimensionality and complex structural patterns present in magnetic resonance imaging (MRI). In this study, we propose a topology-driven framework for brain tumor classification based on Topological Data Analysis (TDA) applied directly to three-dimensional (3D) MRI volumes. Specifically, we analyze 3D Fluid Attenuated Inversion Recovery (FLAIR) images from the BraTS 2020 dataset and extract interpretable topological descriptors using persistent homology. Persistent homology captures intrinsic geometric and structural characteristics of the data through Betti numbers, which describe connected components (Betti-0), loops (Betti-1), and voids (Betti-2). From the 3D MRI volumes, we derive a compact set of 100 topological features that summarize the underlying topology of brain tumor structures. These descriptors represent complex 3D tumor morphology while significantly reducing data dimensionality. Unlike many deep learning approaches that require large-scale training data or complex architectures, the proposed framework relies on computationally efficient topological features extracted directly from the images. These features are used to train classical machine learning classifiers, including Random Forest and XGBoost, for binary classification of high-grade glioma (HGG) and low-grade glioma (LGG). Experimental results on the BraTS 2020 dataset show that the Random Forest classifier combined with selected Betti features achieves an accuracy of 89.19%. These findings highlight the potential of persistent homology as an effective and interpretable approach for analyzing complex 3D medical images and performing brain tumor classification.

Comments:
21 pages, 7 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2603.13771 [cs.CV]

(or
arXiv:2603.13771v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13771

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
304. 【2603.13770】PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

链接https://arxiv.org/abs/2603.13770

作者:Zhexiao Xiong,Yizhi Song,Liu He,Wei Xiong,Yu Yuan,Feng Qiao,Nathan Jacobs

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simulating dynamic scenes, Video Diffusion Models, Diffusion Models, offer a promising, scenes and environments

备注

点击查看摘要

Abstract:Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at this https URL.

305. 【2603.13759】QTrack: Query-Driven Reasoning for Multi-modal MOT

链接https://arxiv.org/abs/2603.13759

作者:Tajamul Ashraf,Tavaheed Tariq,Sonia Yadav,Abrar Ul Riyaz,Wasif Tak,Moloud Abdar,Janibul Bashir

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-object tracking, semantic instructions, traditionally focused, focused on estimating, estimating trajectories

备注: Project Page: [this https URL](https://gaashlab.github.io/QTrack/)

点击查看摘要

Abstract:Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at this https URL

306. 【2603.13756】Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation

链接https://arxiv.org/abs/2603.13756

作者:Yujiro Onishi,Ryo Takizawa,Yoshiyuki Ohmura,Yasuo Kuniyoshi

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated impressive results, Deformable Object Manipulation, Imitation learning, OOD, Deformable Object

备注

点击查看摘要

Abstract:Imitation learning has demonstrated impressive results in robotic manipulation but fails under out-of-distribution (OOD) states. This limitation is particularly critical in Deformable Object Manipulation (DOM), where the near-infinite possible configurations render comprehensive data collection infeasible. Although several methods address OOD states, they typically require exhaustive data or highly precise perception. Such requirements are often impractical for DOM owing to its inherent complexities, including self-occlusion. To address the OOD problem in DOM, we propose a novel framework, Exploration-assisted Bottleneck Transition for Deformable Object Manipulation (ExBot), which addresses the OOD challenge through two key advantages. First, we introduce bottleneck states, standardized configurations that serve as starting points for task execution. This enables the reconceptualization of OOD challenges as the problem of transitioning diverse initial states to these bottleneck states, significantly reducing demonstration requirements. Second, to account for imperfect perception, we partition the OOD state space based on recognizability and employ dual action primitives. This approach enables ExBot to manipulate even unrecognizable states without requiring accurate perception. By concentrating demonstrations around bottleneck states and leveraging exploration to alter perceptual conditions, ExBot achieves both data efficiency and robustness to severe OOD scenarios. Real-world experiments on rope and cloth manipulation demonstrate successful task completion from diverse OOD states, including severe self-occlusions.

307. 【2603.13745】Multi-Object Advertisement Creative Generation

链接https://arxiv.org/abs/2603.13745

作者:Jialu Gao,Mithun Das Gupta,Qun Li,Raveena Kshatriya,Andrew D. Wilson,Keng-hao Chang,Balasaravanan Thoravi Kumaravel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:create lifestyle images, Lifestyle images, Generative Artificial Intelligence, everyday settings, create lifestyle

备注

点击查看摘要

Abstract:Lifestyle images are photographs that capture environments and objects in everyday settings. In furniture product marketing, advertisers often create lifestyle images containing products to resonate with potential buyers, allowing buyers to visualize how the products fit into their daily lives. While recent advances in Generative Artificial Intelligence (GenAI) have given rise to realistic image content creation, their application in e-commerce advertising is challenging because high-quality ads must authentically representing the products in realistic scearios. Therefore, manual intervention is usually required for individual generations, making it difficult to scale to larger product catalogs. To understand the challenges faced by advertisers using GenAI to create lifestyle images at scale, we conducted evaluations on ad images generated using state-of-the-art image generation models and identified the major challenges. Based on our findings, we present CreativeAds, a multi-product ad creation system that supports scalable automated generation with customized parameter adjustment for individual generation. To ensure automated high-quality ad generation, CreativeAds innovates a pipeline that consists of three modules to address challenges in product pairing, layout generation, and background generation separately. Furthermore, CreativeAds contains an intuitive user interface to allow users to oversee generation at scale, and it also supports detailed controls on individual generation for user customized adjustments. We performed a user study on CreativeAds and extensive evaluations of the generated images, demonstrating CreativeAds's ability to create large number of high-quality images at scale for advertisers without requiring expertise in GenAI tools.

308. 【2603.13741】Ego-1K -- A Large-Scale Multiview Video Dataset for Egocentric Vision

链接https://arxiv.org/abs/2603.13741

作者:Jae Yong Lee,Daniel Scharstein,Akash Bapat,Hao Hu,Andrew Fu,Haoru Zhao,Paul Sammut,Xiang Li,Stephen Jeapes,Anik Gupta,Lior David,Saketh Madhuvarasu,Jay Girish Joshi,Jason Wither

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiview videos designed, time-synchronized egocentric multiview, egocentric multiview videos, advance neural, large-scale collection

备注: To appear in CVPR 2026

点击查看摘要

Abstract:We present Ego-1K, a large-scale collection of time-synchronized egocentric multiview videos designed to advance neural 3D video synthesis and dynamic scene understanding. The dataset contains nearly 1,000 short egocentric videos captured with a custom rig with 12 synchronized cameras surrounding a 4-camera VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods, an important research area as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to large disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain. It is available at this https URL.

309. 【2603.13740】Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

链接https://arxiv.org/abs/2603.13740

作者:Zengyan Wang,Sirshapan Mitra,Rajat Modi,Grace Lim,Yogesh Rawat

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:three-view dataset designed, correspondence learning, three-view dataset, dataset designed, dataset combines structured

备注

点击查看摘要

Abstract:We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.

310. 【2603.13739】UniVid: Pyramid Diffusion Model for High Quality Video Generation

链接https://arxiv.org/abs/2603.13739

作者:Xinyu Xiao,Binbin Yang,Tingtian Li,Yipeng Yu,Sen Lei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:prominent research focus, research focus, prominent research, Diffusion-based, generation

备注

点击查看摘要

Abstract:Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects' appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism, whose attention scores can be freely re-weighted for interpolation of between single and two modalities controls during inference. Extensive experiments showcase that our UniVid achieves superior temporal coherence on T2V, I2V and (T+I)2V tasks.

311. 【2603.13728】Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

链接https://arxiv.org/abs/2603.13728

作者:Bo Ma,Jinsong Wu,Wei Qi Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:declared privacy budget, Learning systems, emph, language models, Expectation-Maximization Privacy Assessment

备注

点击查看摘要

Abstract:Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/\epsilon$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{this https URL}{Bodhi-VLM GitHub repository}

312. 【2603.13719】Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

链接https://arxiv.org/abs/2603.13719

作者:Yabin Zhu,Jianqi Li,Chenglong Li,Jiaxiang Wang,Chengjie Gu,Jin Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high resource consumption, Parameter-efficient fine-tuning, parameter storage burden, including time inefficiency, full-model fine-tuning

备注

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.

313. 【2603.13709】REAEDP: Entropy-Calibrated Differentially Private Data Release with Formal Guarantees and Attack-Based Evaluation

链接https://arxiv.org/abs/2603.13709

作者:Bo Ma,Jinsong Wu,Wei Qi Yan

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Sensitive data release, Sensitive data, output-side privacy threats, attribute inference, membership inference

备注

点击查看摘要

Abstract:Sensitive data release is vulnerable to output-side privacy threats such as membership inference, attribute inference, and record linkage. This creates a practical need for release mechanisms that provide formal privacy guarantees while preserving utility in measurable ways. We propose REAEDP, a differential privacy framework that combines entropy-calibrated histogram release, a synthetic-data release mechanism, and attack-based evaluation. On the theory side, we derive an explicit sensitivity bound for Shannon entropy, together with an extension to Rényi entropy, for adjacent histogram datasets, enabling calibrated differentially private release of histogram statistics. We further study a synthetic-data mechanism $\mathcal{F}$ with a privacy-test structure and show that it satisfies a formal differential privacy guarantee under the stated parameter conditions. On multiple public tabular datasets, the empirical entropy change remains below the theoretical bound in the tested regime, standard Laplace and Gaussian baselines exhibit comparable trends, and both membership-inference and linkage-style attack performance move toward random-guess behavior as the privacy parameter decreases. These results support REAEDP as a practically usable privacy-preserving release pipeline in the tested settings. Source code: this https URL

314. 【2603.13708】RSEdit: Text-Guided Image Editing for Remote Sensing

链接https://arxiv.org/abs/2603.13708

作者:Chen Zhenyuan,Zhang Zechuan,Zhang Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:General-domain text-guided image, General-domain text-guided, achieve strong photorealism, hallucinate objects, introduce artifacts

备注

点击查看摘要

Abstract:General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: this https URL

315. 【2603.13695】Steering Generative Models for Accessibility: EasyRead Image Generation

链接https://arxiv.org/abs/2603.13695

作者:Nicolas Dickenmann,Yanis Merzouki,Sonia Laguna,Thy Nowak-Tran,Emanuele Palumbo,Julia E. Vogt,Gerda Binder

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:represent specific concepts, low literacy, intellectual disabilities, language barriers, represent specific

备注

点击查看摘要

Abstract:EasyRead pictograms are simple, visually clear images that represent specific concepts and support comprehension for people with intellectual disabilities, low literacy, or language barriers. The large-scale production of EasyRead content has traditionally been constrained by the cost and expertise required to manually design pictograms. In contrast, automatic generation of such images could significantly reduce production time and cost, enabling broader accessibility across digital and printed materials. However, modern diffusion-based image generation models tend to produce outputs that exhibit excessive visual detail and lack stylistic stability across random seeds, limiting their suitability for clear and consistent pictogram generation. This challenge highlights the need for methods specifically tailored to accessibility-oriented visual content. In this work, we present a unified pipeline for generating EasyRead pictograms by fine-tuning a Stable Diffusion model using LoRA adapters on a curated corpus that combines augmented samples from multiple pictogram datasets. Since EasyRead pictograms lack a unified formal definition, we introduce an EasyRead score to benchmark pictogram quality and consistency. Our results demonstrate that diffusion models can be effectively steered toward producing coherent EasyRead-style images, indicating that generative models can serve as practical tools for scalable and accessible pictogram production.

316. 【2603.13682】Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning

链接https://arxiv.org/abs/2603.13682

作者:Sungrae Hong,Jiwon Jeong,Jisu Shin,Donghee Han,Sol Lee,Kyungeun Kim,Mun Yong Yi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering effective learning, Slide Image, Multiple Instance Learning, effective learning, offering effective

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake-severity-aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severity-weighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel's Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.

317. 【2603.13679】oward Scalable Co-located Practical Learning: Assisting with Computer Vision and Multimodal Analytics

链接https://arxiv.org/abs/2603.13679

作者:Xinyu Li,Linxuan Zhao,Roberto Martinez-Maldonado,Dragan Gasevic,Lixiang Yan

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:single ceiling-mounted camera, capture fine-grained learning, single ceiling-mounted, ceiling-mounted camera, capture fine-grained

备注

点击查看摘要

Abstract:This study examined whether a single ceiling-mounted camera could be used to capture fine-grained learning behaviours in co-located practical learning. In undergraduate nursing simulations, teachers first identified seven observable behaviour categories, which were then used to train a YOLO-based detector. Video data were collected from 52 sessions, and analyses focused on Scenario A because it produced greater behavioural variation than Scenario B. Annotation reliability was high (F1=0.933). On the held-out test set, the model achieved a precision of 0.789, a recall of 0.784, and an mAP@0.5 of 0.827. When only behaviour frequencies were compared, no robust differences were found between high- and low-performing groups. However, when behaviour labels were analysed together with spatial context, clear differences emerged in both task and collaboration performance. Higher-performing teams showed more patient interaction in the primary work area, whereas lower-performing teams showed more phone-related activity and more activity in secondary areas. These findings suggest that behavioural data are more informative when interpreted together with where they occur. Overall, the study shows that a single-camera computer vision approach can support the analysis of teamwork and task engagement in face-to-face practical learning without relying on wearable sensors.

318. 【2603.13669】SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

链接https://arxiv.org/abs/2603.13669

作者:Mahdi Naseri,Zhou Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Image Quality Assessment, aims to estimate, Quality Assessment, estimate perceptual quality, No-Reference Image Quality

备注: Submitted to IEEE Transactions on Image Processing

点击查看摘要

Abstract:No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

319. 【2603.13667】SDCRF: Balancing Privacy and Multi-Object Tracking via Time-Series CRF and Normalized Control Penalty

链接https://arxiv.org/abs/2603.13667

作者:Bo Ma,Jinsong Wu,Weiqi Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:typically disrupts cross-frame, sensitive identity information, reveal sensitive identity, disrupts cross-frame association, Normalized Control Penalty

备注

点击查看摘要

Abstract:Multi-object tracking in video often requires appearance or location cues that can reveal sensitive identity information, while adding privacy-preserving noise typically disrupts cross-frame association and causes ID switches or target loss. We propose TSDCRF, a plug-in refinement framework that balances privacy and tracking by combining three components: (i) $(\varepsilon,\delta)$-differential privacy via calibrated Gaussian noise on sensitive regions under a configurable privacy budget; (ii) a Normalized Control Penalty (NCP) that down-weights unstable or conflicting class predictions before noise injection to stabilize association; and (iii) a time-series dynamic conditional random field (DCRF) that enforces temporal consistency and corrects trajectory deviation after noise, mitigating ID switches and resilience to trajectory hijacking. The pipeline is agnostic to the choice of detector and tracker (e.g., YOLOv4 and DeepSORT). We evaluate on MOT16, MOT17, Cityscapes, and KITTI. Results show that TSDCRF achieves a better privacy--utility trade-off than white noise and prior methods (NTPD, PPDTSA): lower KL-divergence shift, lower tracking RMSE, and improved robustness under trajectory hijacking while preserving privacy. Source code in this https URL

320. 【2603.13660】Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

链接https://arxiv.org/abs/2603.13660

作者:Yunhe Gao,Yabin Zhang,Chong Wang,Jiaming Liu,Maya Varma,Jean-Benoit Delbrouck,Akshay Chaudhari,Curtis Langlotz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lacks analogous approaches, imaging lacks analogous, analogous approaches, transformed vision, vision and language

备注: CVPR 2026

点击查看摘要

Abstract:Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: this https URL.

321. 【2603.13659】FMS$^2$: Unified Flow Matching for Segmentation and Synthesis of Thin Structures

链接https://arxiv.org/abs/2603.13659

作者:Babak Asadi,Peiyang Wu,Mani Golparvar-Fard,Viraj Shah,Ramez Hajj

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Segmenting thin structures, Segmenting thin, high annotation costs, thin structures, structures like infrastructure

备注

点击查看摘要

Abstract:Segmenting thin structures like infrastructure cracks and anatomical vessels is a task hampered by topology-sensitive geometry, high annotation costs, and poor generalization across domains. Existing methods address these challenges in isolation. We propose FMS$^2$, a flow-matching framework with two modules. (1) SegFlow is a 2.96M-parameter segmentation model built on a standard encoder-decoder backbone that recasts prediction as continuous image $\rightarrow$ mask transport. It learns a time-indexed velocity field with a flow-matching regression loss and outputs the mask via ODE integration, rather than supervising only end-state logits. This trajectory-level supervision improves thin-structure continuity and sharpness, compared with tuned topology-aware loss baselines, without auxiliary topology heads, post-processing, or multi-term loss engineering. (2) SynFlow is a mask-conditioned mask $\rightarrow$ image generator that produces pixel-aligned synthetic image-mask pairs. It injects mask geometry at multiple scales and emphasizes boundary bands via edge-aware gating, while a controllable mask generator expands sparsity, width, and branching regimes. On five crack and vessel benchmarks, SegFlow alone outperforms strong CNN, Transformer, Mamba, and generative baselines, improving the volumetric metric (mean IoU) from 0.511 to 0.599 (+17.2%) and reducing the topological metric (Betti matching error) from 82.145 to 51.524 (-37.3%). When training with limited labels, augmenting SegFlow with SynFlow-generated pairs recovers near-full performance using 25% of real annotations and improves cross-domain IoU by 0.11 on average. Unlike classical data augmentation that promotes invariance via label-preserving transforms, SynFlow provides pixel-aligned paired supervision with controllable structural shifts (e.g., sparsity, width, branching), which is particularly effective under domain shift.

322. 【2603.13652】Causal Attribution via Activation Patching

链接https://arxiv.org/abs/2603.13652

作者:Amirmohammad Izadi,Mohammadali Banayeeanzade,Alireza Mirrokni,Hosein Hasani,Mobin Bagherian,Faridoun Mehri,Mahdieh Soleymani Baghshah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, identify image regions, attributions remains challenging, well-localized attributions remains, individual image patches

备注

点击查看摘要

Abstract:Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.

323. 【2603.13628】Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models

链接https://arxiv.org/abs/2603.13628

作者:Bo Yu,Fengze Yang,Yiming Liu,Chao Wang,Xuewen Luo,Taozhe Li,Ruimin Ke,Xiaofan Zhou,Chenxi Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-Language Models, retrieval-augmented generation, Optimized Locatability Score, emergence of Vision-Language, introduced new paradigms

备注

点击查看摘要

Abstract:The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.

324. 【2603.13615】Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

链接https://arxiv.org/abs/2603.13615

作者:Dayou Li,Lulin Liu,Bangya Liu,Shijie Zhou,Jiu Feng,Ziqi Lu,Minghui Zheng,Chenyu You,Zhiwen Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:scalable data source, video generators relying, future object states, scalable data, data source

备注

点击查看摘要

Abstract:To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.

325. 【2603.13609】A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas

链接https://arxiv.org/abs/2603.13609

作者:Mohammad Sahnoon Merkebe Getachew Demissie,Roberto Souza

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structures remain underexplored, remain underexplored, input structures remain, progress in deep, statistical validation

备注: 16 pages, 7 tables, 10 figures

点击查看摘要

Abstract:Despite progress in deep learning for shared micromobility demand prediction, the systematic design and statistical validation of temporal input structures remain underexplored. Temporal features are often selected heuristically, even though historical demand strongly affects model performance and generalizability. This paper introduces a reproducible data-processing pipeline and a statistically grounded method for designing temporal input structures for image-to-image demand prediction. Using large-scale e-scooter data from Austin, Texas, we build a grid-based spatiotemporal dataset by converting trip records into hourly pickup and dropoff demand images. The pipeline includes trip filtering, mapping Census Tracts to spatial locations, grid construction, demand aggregation, and creation of a global activity mask that limits evaluation to historically active areas. This representation supports consistent spatial learning while preserving demand patterns. We then introduce a combined correlation- and error-based procedure to identify informative historical inputs. Optimal temporal depth is selected through an ablation study using a baseline UNET model with paired non-parametric tests and Holm correction. The resulting temporal structures capture short-term persistence as well as daily and weekly cycles. Compared with adjacent-hour and fixed-period baselines, the proposed design reduces mean squared error by up to 37 percent for next-hour prediction and 35 percent for next-24-hour prediction. These results highlight the value of principled dataset construction and statistically validated temporal input design for spatiotemporal micromobility demand prediction.

326. 【2603.13595】A Causal Framework for Mitigating Data Shifts in Healthcare

链接https://arxiv.org/abs/2603.13595

作者:Kurt Butler,Stephanie Riley,Damian Machlanski,Edward Moroshko,Panagiotis Dimitrakopoulos,Thomas Melistas,Akchunya Chanchal,Konstantinos Vilouras,Zhihua Liu,Steven McDonagh,Hana Chockler,Ben Glocker,Niccolo Tempini,Matthew Sperrin,Sotirios A Tsaftaris,Ricardo Silva

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse patient populations, medical research, perform reliably, patient populations, populations and heterogeneous

备注: 21 pages, 3 figures

点击查看摘要

Abstract:Developing predictive models that perform reliably across diverse patient populations and heterogeneous environments is a core aim of medical research. However, generalization is only possible if the learned model is robust to statistical differences between data used for training and data seen at the time and place of deployment. Domain generalization methods provide strategies to address data shifts, but each method comes with its own set of assumptions and trade-offs. To apply these methods in healthcare, we must understand how domain shifts arise, what assumptions we prefer to make, and what our design constraints are. This article proposes a causal framework for the design of predictive models to improve generalization. Causality provides a powerful language to characterize and understand diverse domain shifts, regardless of data modality. This allows us to pinpoint why models fail to generalize, leading to more principled strategies to prepare for and adapt to shifts. We recommend general mitigation strategies, discussing trade-offs and highlighting existing work. Our causality-based perspective offers a critical foundation for developing robust, interpretable, and clinically relevant AI solutions in healthcare, paving the way for reliable real-world deployment.

327. 【2603.13590】Opportunistic Cardiac Health Assessment: Estimating Phenotypes from Localizer MRI through Multi-Modal Representations

链接https://arxiv.org/abs/2603.13590

作者:Busra Nur Zeybek,Özgün Turgut,Yundi Zhang,Jiazhen Pan,Robert Graf,Sophie Starck,Daniel Rueckert,Sevgi Gokce Kafali

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Cardiovascular diseases, Cardiovascular, Cardiac, magnetic resonance, localizer MRI

备注

点击查看摘要

Abstract:Cardiovascular diseases are the leading cause of death. Cardiac phenotypes (CPs), e.g., ejection fraction, are the gold standard for assessing cardiac health, but they are derived from cine cardiac magnetic resonance imaging (CMR), which is costly and requires high spatio-temporal resolution. Every magnetic resonance (MR) examination begins with rapid and coarse localizers for scan planning, which are discarded thereafter. Despite non-diagnostic image quality and lack of temporal information, localizers can provide valuable structural information rapidly. In addition to imaging, patient-level information, including demographics and lifestyle, influence the cardiac health assessment. Electrocardiograms (ECGs) are inexpensive, routinely ordered in clinical practice, and capture the temporal activity of the heart. Here, we introduce C-TRIP (Cardiac Tri-modal Representations for Imaging Phenotypes), a multi-modal framework that aligns localizer MRI, ECG signals, and tabular metadata to learn a robust latent space and predict CPs using localizer images as an opportunistic alternative to CMR. By combining these three modalities, we leverage cheap spatial and temporal information from localizers, and ECG, respectively while benefiting from patient-specific context provided by tabular data. Our pipeline consists of three stages. First, encoders are trained independently to learn uni-modal representations. The second stage fuses the pre-trained encoders to unify the latent space. The final stage uses the enriched representation space for CP prediction, with inference performed exclusively on localizer MRI. Proposed C-TRIP yields accurate functional CPs, and high correlations for structural CPs. Since localizers are inherently rapid and low-cost, our C-TRIP framework could enable better accessibility for CP estimation.

328. 【2603.13589】Volumetric Radar Echo Motion Estimation Using Physics-Informed Deep Learning: A Case Study Over Slovakia

链接https://arxiv.org/abs/2603.13589

作者:Peter Pavlík,Anna Bou Ezzeddine,Viera Rozinajová

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:extrapolation-based methods rely, motion, precipitation systems, horizontal motion, extrapolation-based methods

备注: To be submitted to a fitting journal

点击查看摘要

Abstract:In precipitation nowcasting, most extrapolation-based methods rely on two-dimensional radar composites to estimate the horizontal motion of precipitation systems. However, in some cases, precipitation systems can exhibit varying motion at different heights. We propose a physics-informed convolutional neural network that estimates independent horizontal motion fields for multiple altitude layers directly from volumetric radar reflectivity data and investigate the practical benefits of altitude-wise motion field estimation for precipitation nowcasting. The model is trained end-to-end on volumetric observations from the Slovak radar network and its extrapolation nowcasting performance is evaluated. We compare the proposed model against an architecturally identical baseline operating on vertically pooled two-dimensional radar composites. Our results show that, although the model successfully learns altitude-wise motion fields, the estimated displacement is highly correlated across vertical levels for the vast majority of precipitation events. Consequently, the volumetric approach does not yield systematic improvements in nowcasting accuracy. While categorical metrics indicate increased precipitation detection at longer lead times, this gain is largely attributable to non-physical artifacts and is accompanied by a growing positive bias. A comprehensive inter-altitude motion field correlation analysis further confirms that events exhibiting meaningful vertical variability in horizontal motion are rare in the studied region. We conclude that, for the Slovak radar dataset, the additional complexity of three-dimensional motion field estimation is not justified by questionable gains in predictive skill. Nonetheless, the proposed framework remains applicable in climates where precipitation systems exhibit stronger vertical variability in horizontal motion.

329. 【2603.13578】LingoMotion: An Interpretable and Unambiguous Symbolic Representation for Human Motion

链接https://arxiv.org/abs/2603.13578

作者:Yao Zhang,Zhuchenyang Liu,Yu Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:black-box latent vectors, Existing representations, operate as black-box, black-box latent, latent vectors

备注

点击查看摘要

Abstract:Existing representations for human motion, such as MotionGPT, often operate as black-box latent vectors with limited interpretability and build on joint positions which can cause ambiguity. Inspired by the hierarchical structure of natural languages - from letters to words, phrases, and sentences - we propose LingoMotion, a motion language that facilitates interpretable and unambiguous symbolic representation for both simple and complex human motion. In this paper, we introduce the concept design of LingoMotion, including the definitions of motion alphabet based on joint angles, the morphology for forming words and phrases to describe simple actions like walking and their attributes like speed and scale, as well as the syntax for describing more complex human activities with sequences of words and phrases. The preliminary results, including the implementation and evaluation of motion alphabet using a large-scale motion dataset Motion-X, demonstrate the high fidelity of motion representation.

330. 【2603.13573】Analytical Logit Scaling for High-Resolution Sea Ice Topology Retrieval from Weakly Labeled SAR Imagery

链接https://arxiv.org/abs/2603.13573

作者:Reda Elwaradi,Julien Gimenez,Stéphane Hordoir,Mehdi Ait Hamma,Adrien Chan-Hon-Tong,Flora Weissgerber

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, mapping using Synthetic, crucial for Arctic

备注

点击查看摘要

Abstract:High-resolution sea ice mapping using Synthetic Aperture Radar (SAR) is crucial for Arctic navigation and climate monitoring. However, operational ice charts provide only coarse, region-level polygons (weak labels), forcing automated segmentation models to struggle with pixel-level accuracy and often yielding under-confident, blurred concentration maps. In this paper, we propose a weakly supervised deep learning pipeline that fuses Sentinel-1 SAR and AMSR-2 radiometry data using a U-Net architecture trained with a region-based loss. To overcome the severe under-confidence caused by weak labels, we introduce an Analytical Logit Scaling method applied post-inference. By dynamically calculating the temperature and bias based on the latent space percentiles (2\% and 98\%) of each scene, we force a physical binarization of the predictions. This adaptive scaling acts as a topological extractor, successfully revealing fine-grained sea ice fractures (leads) at a 40-meter resolution without requiring any manual pixel-level annotations. Our approach not only resolves local topology but also perfectly preserves regional macroscopic concentrations, achieving a 78\% accuracy on highly fragmented summer scenes, thereby bridging the gap between weakly supervised learning and high-resolution physical segmentation.

331. 【2603.13571】DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models

链接https://arxiv.org/abs/2603.13571

作者:Xiaoqiong Liu,Heng Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained increasing attention, increasing attention owing, enhancing vision foundation, pixel-level understanding tasks, vision foundation models

备注

点击查看摘要

Abstract:Recently, feature upsampling has gained increasing attention owing to its effectiveness in enhancing vision foundation models (VFMs) for pixel-level understanding tasks. Existing methods typically rely on high-resolution features from the same foundation model to achieve upsampling via self-reconstruction. However, relying solely on intra-model features forces the upsampler to overfit to the source model's inherent location misalignment and high-norm artifacts. To address this fundamental limitation, we propose DiveUp, a novel framework that breaks away from single-model dependency by introducing multi-VFM relational guidance. Instead of naive feature fusion, DiveUp leverages diverse VFMs as a panel of experts, utilizing their structural consensus to regularize the upsampler's learning process, effectively preventing the propagation of inaccurate spatial structures from the source model. To reconcile the unaligned feature spaces across different VFMs, we propose a universal relational feature representation, formulated as a local center-of-mass (COM) field, that extracts intrinsic geometric structures, enabling seamless cross-model interaction. Furthermore, we introduce a spikiness-aware selection strategy that evaluates the spatial reliability of each VFM, effectively filtering out high-norm artifacts to aggregate guidance from only the most reliable expert at each local region. DiveUp is a unified, encoder-agnostic framework; a jointly-trained model can universally upsample features from diverse VFMs without requiring per-model retraining. Extensive experiments demonstrate that DiveUp achieves state-of-the-art performance across various downstream dense prediction tasks, validating the efficacy of multi-expert relational guidance. Our code and models are available at: this https URL

332. 【2603.13557】Performance evaluation of deep learning models for image analysis: considerations for visual control and statistical metrics

链接https://arxiv.org/abs/2603.13557

作者:Christof A. Bertram,Jonas Ammeling,Alexander Bartel,Gillian Beamer,Marc Aubreville

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning-based automated, outperform trained pathologists, Deep learning-based, automated image analysis, learning-based automated image

备注

点击查看摘要

Abstract:Deep learning-based automated image analysis (DL-AIA) has been shown to outperform trained pathologists in tasks related to feature quantification. Related to these capacities the use of DL-AIA tools is currently extending from proof-of-principle studies to routine applications such as patient samples (diagnostic pathology), regulatory safety assessment (toxicologic pathology), and recurrent research tasks. To ensure that DL-AIA applications are safe and reliable, it is critical to conduct a thorough and objective generalization performance assessment (i.e., the ability of the algorithm to accurately predict patterns of interest) and possibly evaluate model robustness (i.e., the algorithm's capacity to maintain predictive accuracy on images from different sources). In this article, we review the practices for performance assessment in veterinary pathology publications by which two approaches were identified: 1) Exclusive visual performance control (i.e. eyeballing of algorithmic predictions) plus validation of the models application utilizing secondary performance indices, and 2) Statistical performance control (alongside the other methods), which requires a dataset creation and separation of an hold-out test set prior to model training. This article compares the strengths and weaknesses of statistical and visual performance control methods. Furthermore, we discuss relevant considerations for rigorous statistical performance evaluation including metric selection, test dataset image composition, ground truth label quality, resampling methods such as bootstrapping, statistical comparison of multiple models, and evaluation of model stability. It is our conclusion that visual and statistical evaluation have complementary strength and a combination of both provides the greatest insight into the DL model's performance and sources of error.

333. 【2603.13556】Semantic Aware Feature Extraction for Enhanced 3D Reconstruction

链接https://arxiv.org/abs/2603.13556

作者:Ronald Nap,Andy Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:including simultaneous localization, image stitching, wide-ranging applications, including simultaneous, fundamental problem

备注

点击查看摘要

Abstract:Feature matching is a fundamental problem in computer vision with wide-ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher-level semantic information. This work proposes a semantic-aware feature extraction framework that employs multi-task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi-floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi-level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.

334. 【2603.13547】NumColor: Precise Numeric Color Control in Text-to-Image Generation

链接https://arxiv.org/abs/2603.13547

作者:Muhammad Atif Butt,Diego Hernandez,Alexandra Gomez-Villa,Kai Wang,Javier Vazquez-Corral,Joost Van De Weijer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language descriptions, language descriptions, interpret numerical colors, excel at generating, natural language

备注

点击查看摘要

Abstract:Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-{\alpha}, and PixArt-{\Sigma} without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.

335. 【2603.13533】SAIF: A Stability-Aware Inference Framework for Medical Image Segmentation with Segment Anything Model

链接https://arxiv.org/abs/2603.13533

作者:Ke Wu,Shiqi Chen,Yiheng Zhong,Hengxian Liu,Yingxue Su,Yifang Wang,Junhao Jin,Guangyu Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Segment Anything Model, enable scalable medical, scalable medical image, medical image segmentation, enable scalable

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Segment Anything Model (SAM) enable scalable medical image segmentation but suffer from inference-time instability when deployed as a frozen backbone. In practice, bounding-box prompts often contain localization errors, and fixed threshold binarization introduces additional decision uncertainty. These factors jointly cause high prediction variance, especially near object boundaries, degrading reliability. We propose the Stability-Aware Inference Framework (SAIF), a training-free and plug-and-play inference framework that improves robustness by explicitly modeling prompt and threshold uncertainty. SAIF constructs a joint uncertainty space via structured box perturbations and threshold variations, evaluates each hypothesis using decision stability and boundary consistency, and introduces a stability-consistency score to filter unstable candidates and perform stability-weighted fusion in probability space. Experiments on Synapse, CVC-ClinicDB, Kvasir-SEG, and CVC-300 demonstrate that SAIF consistently improves segmentation accuracy and robustness, achieving state-of-the-art performance without retraining or architectural modification. Our anonymous code is released at this https URL.

336. 【2603.13524】Hide and Seek: Investigating Redundancy in Earth Observation Imagery

链接https://arxiv.org/abs/2603.13524

作者:Tasos Papazafeiropoulos,Nikolaos Ioannis Bountos,Nikolas Papadopoulos,Ioannis Papoutsis

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Earth Observation, Computer Vision, driven rapid progress, availability of Earth, advances in Computer

备注

点击查看摘要

Abstract:The growing availability of Earth Observation (EO) data and recent advances in Computer Vision have driven rapid progress in machine learning for EO, producing domain-specific models at ever-increasing scales. Yet this progress risks overlooking fundamental properties of EO data that distinguish it from other domains. We argue that EO data exhibit a multidimensional redundancy (spectral, temporal, spatial, and semantic) which has a more pronounced impact on the domain and its applications than what current literature reflects. To validate this hypothesis, we conduct a systematic domain-specific investigation examining the existence, consistency, and practical implications of this phenomenon across key dimensions of EO variability. Our findings confirm that redundancy in EO data is both substantial and pervasive: exploiting it yields comparable performance ($\approx98.5\%$ of baseline) at a fraction of the computational cost ($\approx4\times$ fewer GFLOPs), at both training and inference. Crucially, these gains are consistent across tasks, geospatial locations, sensors, ground sampling distances, and architectural designs; suggesting that multi-faceted redundancy is a structural property of EO data rather than an artifact of specific experimental choices. These results lay the groundwork for more efficient, scalable, and accessible large-scale EO models.

337. 【2603.13521】Eleven Primitives and Three Gates: The Universal Structure of Computational Imaging

链接https://arxiv.org/abs/2603.13521

作者:Chengshuai Yang,Xin Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hidden structural simplicity, Toggle, Finite Primitive Basis, Primitive Basis Theorem, Computational imaging systems

备注: 39 pages, 5 figures, 2 extended data tables, supplementary information

点击查看摘要

Abstract:Computational imaging systems -- from coded-aperture cameras to cryo-electron microscopes -- span five carrier families yet share a hidden structural simplicity. We prove that every imaging forward model decomposes into a directed acyclic graph over exactly 11 physically typed primitives (Finite Primitive Basis Theorem) -- a sufficient and minimal basis that provides a compositional language for designing any imaging modality. We further prove that every reconstruction failure has exactly three independent root causes: information deficiency, carrier noise, and operator mismatch (Triad Decomposition). The three gates map to the system lifecycle: Gates 1 and 2 guide design (sampling geometry, carrier selection); Gate 3 governs deployment-stage calibration and drift correction. Validation across 12 modalities and all five carrier families confirms both results, with +0.8 to +13.9 dB recovery on deployed instruments. Together, the 11 primitives and 3 gates establish the first universal grammar for designing, diagnosing, and correcting computational imaging systems.

Comments:
39 pages, 5 figures, 2 extended data tables, supplementary information

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68U10, 94A08

ACMclasses:
I.4.5; I.4.9

Cite as:
arXiv:2603.13521 [cs.CV]

(or
arXiv:2603.13521v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13521

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Chengshuai Yang [view email] [v1]
Fri, 13 Mar 2026 18:54:35 UTC (263 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Eleven Primitives and Three Gates: The Universal Structure of Computational Imaging, by Chengshuai Yang and 1 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-03

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

338. 【2603.13520】A Systematic Benchmark of GAN Architectures for MRI-to-CT Synthesis

链接https://arxiv.org/abs/2603.13520

作者:Alessandro Pesci,Valerio Guarrasi,Marco Alì,Isabella Castiglioni,Paolo Soda

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic resonance imaging, Generative Adversarial Network, Computed tomography, Magnetic resonance, facilitate MRI-only clinical

备注

点击查看摘要

Abstract:The translation from Magnetic resonance imaging (MRI) to Computed tomography (CT) has been proposed as an effective solution to facilitate MRI-only clinical workflows while limiting exposure to ionizing radiation. Although numerous Generative Adversarial Network (GAN) architectures have been proposed for MRI-to-CT translation, systematic and fair comparisons across heterogeneous models remain limited. We present a comprehensive benchmark of ten GAN architectures evaluated on the SynthRAD2025 dataset across three anatomical districts (abdomen, thorax, head-and-neck). All models were trained under a unified validation protocol with identical preprocessing and optimization settings. Performance was assessed using complementary metrics capturing voxel-wise accuracy, structural fidelity, perceptual quality, and distribution-level realism, alongside an analysis of computational complexity. Supervised Paired models consistently outperformed Unpaired approaches, confirming the importance of voxel-wise supervision. Pix2Pix achieved the most balanced performance across districts while maintaining a favorable quality-to-complexity trade-off. Multi-district training improved structural robustness, whereas intra-district training maximized voxel-wise fidelity. This benchmark provides quantitative and computational guidance for model selection in MRI-only radiotherapy workflows and establishes a reproducible framework for future comparative studies. To ensure the reproducibility of our experiments we make our code public, together with the overall results, at the following link:this https URL

339. 【2603.13507】MIRAGE: Model-agnostic Industrial Realistic Anomaly Generation and Evaluation for Visual Anomaly Detection

链接https://arxiv.org/abs/2603.13507

作者:Jinwei Hu,Francesco Borsatti,Arianna Stropeni,Davide Dalle Pezze,Manuel Barusco,Gian Antonio Susto

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:performance improves substantially, Model-agnostic Industrial Realistic, Industrial Realistic Anomaly, limited anomalous data, typically trained

备注

点击查看摘要

Abstract:Industrial visual anomaly detection (VAD) methods are typically trained on normal samples only, yet performance improves substantially when even limited anomalous data is available. Existing anomaly generation approaches either require real anomalous examples, demand expensive hardware, or produce synthetic defects that lack realism. We present MIRAGE (Model-agnostic Industrial Realistic Anomaly Generation and Evaluation), a fully automated pipeline for realistic anomalous image generation and pixel-level mask creation that requires no training and no anomalous images. Our pipeline accesses any generative model as a black box via API calls, uses a VLM for automatic defect prompt generation, and includes a CLIP-based quality filter to retain only well-aligned generated images. For mask generation at scale, we introduce a lightweight, training-free dual-branch semantic change detection module combining text-conditioned Grounding DINO features with fine-grained YOLOv26-Seg structural features. We benchmark four generation methods using Gemini 2.5 Flash Image (Nano Banana) as the generative backbone, evaluating performance on MVTec AD and VisA across two distinct tasks: (i) downstream anomaly segmentation and (ii) visual quality of the generated images, assessed via standard metrics (IS, IC-LPIPS) and a human perceptual study involving 31 participants and 1,550 pairwise votes. The results demonstrate that MIRAGE offers a scalable, accessible foundation for anomaly-aware industrial inspection that requires no real defect data. As a final contribution, we publicly release a large-scale dataset comprising 500 image-mask pairs per category for every MVTec AD and VisA class, over 13,000 pairs in total, alongside all generation prompts and pipeline code.

340. 【2603.13506】LibraGen: Playing a Balance Game in Subject-Driven Video Generation

链接https://arxiv.org/abs/2603.13506

作者:Jiahao Zhu,Shanshan Lao,Lijie Liu,Gen Li,Tianhao Qi,Wei Han,Bingchuan Li,Fangfang Liu,Zhuowei Chen,Tianxiang Ma,Qian HE,Yi Zhou,Xiaohua Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted growing attention, growing attention, video generation foundation, advancement of video, attracted growing

备注

点击查看摘要

Abstract:With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.

341. 【2603.13500】ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

链接https://arxiv.org/abs/2603.13500

作者:Eric Nazarenus,Chuqiao Li,Yannan He,Xianghui Xie,Jan Eric Lenssen,Gerard Pons-Moll

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:framework that bridges, high-quality offline generation, bridges real-time streaming, unified motion diffusion, motion diffusion framework

备注: Project page: [this https URL](https://coral79.github.io/ActionPlan/)

点击查看摘要

Abstract:We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.

342. 【2603.13497】Synthetic Melanoma Image Generation and Evaluation Using Generative Adversarial Networks

链接https://arxiv.org/abs/2603.13497

作者:Pei-Yu Lin,Yidan Shen,Neville Mathew,Renjie Hu,Siyu Huang,Courtney M. Queen,Cameron E. West,Ana Ciurea,George Zouridakis

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:improving patient outcomes, skin cancer, patient outcomes, lethal form, form of skin

备注: 18 pages, 7 figures. already accepted to MDPI bioengineering

点击查看摘要

Abstract:Melanoma is the most lethal form of skin cancer, and early detection is critical for improving patient outcomes. Although dermoscopy combined with deep learning has advanced automated skin-lesion analysis, progress is hindered by limited access to large, well-annotated datasets and by severe class imbalance, where melanoma images are substantially underrepresented. To address these challenges, we present the first systematic benchmarking study comparing four GAN architectures-DCGAN, StyleGAN2, and two StyleGAN3 variants (T/R)-for high-resolution melanoma-specific synthesis. We train and optimize all models on two expert-annotated benchmarks (ISIC 2018 and ISIC 2020) under unified preprocessing and hyperparameter exploration, with particular attention to R1 regularization tuning. Image quality is assessed through a multi-faceted protocol combining distribution-level metrics (FID), sample-level representativeness (FMD), qualitative dermoscopic inspection, downstream classification with a frozen EfficientNet-based melanoma detector, and independent evaluation by two board-certified dermatologists. StyleGAN2 achieves the best balance of quantitative performance and perceptual quality, attaining FID scores of 24.8 (ISIC 2018) and 7.96 (ISIC 2020) at gamma=0.8. The frozen classifier recognizes 83% of StyleGAN2-generated images as melanoma, while dermatologists distinguish synthetic from real images at only 66.5% accuracy (chance = 50%), with low inter-rater agreement (kappa = 0.17). In a controlled augmentation experiment, adding synthetic melanoma images to address class imbalance improved melanoma detection AUC from 0.925 to 0.945 on a held-out real-image test set. These findings demonstrate that StyleGAN2-generated melanoma images preserve diagnostically relevant features and can provide a measurable benefit for mitigating class imbalance in melanoma-focused machine learning pipelines.

343. 【2603.13467】Resolving Interference (RI): Disentangling Models for Improved Model Merging

链接https://arxiv.org/abs/2603.13467

作者:Pratik Ramesh,George Stoica,Arun Iyer,Leshem Choshen,Judy Hoffman

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reducing cross-task interference, Cross-Task Interference, shown that multitask, created by directly, directly combining

备注

点击查看摘要

Abstract:Model merging has shown that multitask models can be created by directly combining the parameters of different models that are each specialized on tasks of interest. However, models trained independently on distinct tasks often exhibit interference that degrades the merged model's performance. To solve this problem, we formally define the notion of Cross-Task Interference as the drift in the representation of the merged model relative to its constituent models. Reducing cross-task interference is key to improving merging performance. To address this issue, we propose our method, Resolving Interference (RI), a light-weight adaptation framework which disentangles expert models to be functionally orthogonal to the space of other tasks, thereby reducing cross-task interference. RI does this whilst using only unlabeled auxiliary data as input (i.e., no task-data is needed), allowing it to be applied in data-scarce scenarios. RI consistently improves the performance of state-of-the-art merging methods by up to 3.8% and generalization to unseen domains by up to 2.3%. We also find RI to be robust to the source of auxiliary input while being significantly less sensitive to tuning of merging hyperparameters. Our codebase is available at: this https URL

344. 【2603.13450】LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models

链接https://arxiv.org/abs/2603.13450

作者:Chenglin Wang,Yucheng Zhou,Shawn Chen,Tao Wang,Kai Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Discrete Diffusion Language, Diffusion Language Models, Language Models, high inference latency, inference latency arising

备注

点击查看摘要

Abstract:Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.

345. 【2603.13438】Draft-and-Target Sampling for Video Generation Policy

链接https://arxiv.org/abs/2603.13438

作者:Qikang Zhang,Yingjie Lei,Wei Liu,Daochang Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:task conditioned, task description, description and observation, predict the future, future states

备注

点击查看摘要

Abstract:Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.

346. 【2603.13437】Vision-Language Based Expert Reporting for Painting Authentication and Defect Detection

链接https://arxiv.org/abs/2603.13437

作者:Eman Ouda,Mohammed Salah,Arsenii O. Chulkov,Gianfranco Gargiulo,Gian Luca Tartaglia,Stefano Sfarra,Yusra Abdulrahman

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:output remain largely, remain largely bespoke, thermographic output remain, Active Infrared Thermography, limiting systematic integration

备注: Submitted to Journal of Cultural Heritage

点击查看摘要

Abstract:Authenticity and condition assessment are central to conservation decision-making, yet interpretation and reporting of thermographic output remain largely bespoke and expert-dependent, complicating comparison across collections and limiting systematic integration into conservation documentation. Pulsed Active Infrared Thermography (AIRT) is sensitive to subsurface features such as material heterogeneity, voids, and past interventions; however, its broader adoption is constrained by artifact misinterpretation, inter-laboratory variability, and the absence of standardized, explainable reporting frameworks. Although multi-modal thermographic processing techniques are established, their integration with structured natural-language interpretation has not been explored in cultural heritage. A fully automated thermography-vision-language model (VLM) framework is presented. It combines multi-modal AIRT analysis with modality-aware textual reporting, without human intervention during inference. Thermal sequences are processed using Principal Component Thermography (PCT), Thermographic Signal Reconstruction (TSR), and Pulsed Phase Thermography (PPT), and the resulting anomaly masks are fused into a consensus segmentation that emphasizes regions supported by multiple thermal indicators while mitigating boundary artifacts. The fused evidence is provided to a VLM, which generates structured reports describing the location of the anomaly, thermal behavior, and plausible physical interpretations while explicitly acknowledging the uncertainty and diagnostic limitations. Evaluation on two marquetries demonstrates consistent anomaly detection and stable structured interpretations, indicating reproducibility and generalizability across samples.

347. 【2603.13435】CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

链接https://arxiv.org/abs/2603.13435

作者:Shuhan Xu,Siyuan Liang,Hongling Zheng,Yong Luo,Han Hu,Lefei Zhang,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:models increasingly exhibit, implicitly capturing temporal, capturing temporal dynamics, increasingly exhibit, properties by implicitly

备注

点击查看摘要

Abstract:Diffusion-based image-to-video (I2V) models increasingly exhibit world-model-like properties by implicitly capturing temporal dynamics. However, existing studies have mainly focused on visual quality and controllability, and the robustness of the state transition learned by the model remains understudied. To fill this gap, we are the first to analyze the vulnerability of I2V models, find that temporal control mechanisms constitute a new attack surface, and reveal the challenge of modeling them uniformly under different attack settings. Based on this, we propose a trajectory-control attack, called CtrlAttack, to interfere with state evolution during the generation process. Specifically, we represent the perturbation as a low-dimensional velocity field and construct a continuous displacement field via temporal integration, thereby affecting the model's state transitions while maintaining temporal consistency; meanwhile, we map the perturbation to the observation space, making the method applicable to both white-box and black-box attack settings. Experimental results show that even under low-dimensional and strongly regularized perturbation constraints, our method can still significantly disrupt temporal consistency by increasing the attack success rate (ASR) to over 90% in the white-box setting and over 80% in the black-box setting, while keeping the variation of the FID and FVD within 6 and 130, respectively, thus revealing the potential security risk of I2V models at the level of state dynamics.

348. 【2603.13432】Spatial Transcriptomics as Images for Large-Scale Pretraining

链接https://arxiv.org/abs/2603.13432

作者:Yishun Zhu,Jiaxin Qi,Jian Wang,Yuhua Zheng,Jianqiang Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:profiles thousands, tissue sections, precise coordinates, coordinates on tissue, essential for clinical

备注

点击查看摘要

Abstract:Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains ill-posed. Existing choices fall into two camps: (1) treating each spot as an independent sample, which discards spatial dependencies and collapses ST into single-cell transcriptomics; and (2) treating an entire slide as a single sample, which produces prohibitively large inputs and drastically fewer training examples, undermining effective pretraining. To address this gap, we propose treating spatial transcriptomics as croppable images. Specifically, we define a multi-channel image representation with fixed spatial size by cropping patches from raw slides, thereby preserving spatial context while substantially increasing the number of training samples. Along the channel dimension, we define gene subset selection rules to control input dimensionality and improve pretraining stability. Extensive experiments show that the proposed image-like dataset construction for ST pretraining consistently improves downstream performance, outperforming conventional pretraining schemes. Ablation studies verify that both spatial patching and channel design are necessary, establishing a unified, practical paradigm for organizing ST data and enabling large-scale pretraining.

349. 【2603.13429】A Deformable Attention-Based Detection Transformer with Cross-Scale Feature Fusion for Industrial Coil Spring Inspection

链接https://arxiv.org/abs/2603.13429

作者:Matteo Rossi,Pony Matt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:substantial scale variations, Automated visual inspection, presents significant challenges, significant challenges due, complex industrial backgrounds

备注

点击查看摘要

Abstract:Automated visual inspection of locomotive coil springs presents significant challenges due to the morphological diversity of surface defects, substantial scale variations, and complex industrial backgrounds. This paper proposes MSD-DETR (Multi-Scale Deformable Detection Transformer), a novel detection framework that addresses these challenges through three key innovations: (1) a structural re-parameterization strategy that decouples training-time multi-branch topology from inference-time efficiency, enhancing feature extraction while maintaining real-time performance; (2) a deformable attention mechanism that enables content-adaptive spatial sampling, allowing dynamic focus on defect-relevant regions regardless of morphological irregularity; and (3) a cross-scale feature fusion architecture incorporating GSConv modules and VoVGSCSP blocks for effective multi-resolution information aggregation. Comprehensive experiments on a real-world locomotive coil spring dataset demonstrate that MSD-DETR achieves 92.4\% mAP@0.5 at 98 FPS, outperforming state-of-the-art detectors including YOLOv8 (+3.1\% mAP) and the baseline RT-DETR (+2.8\% mAP) while maintaining comparable inference speed, establishing a new benchmark for industrial coil spring quality inspection.

350. 【2603.13427】MIBench: Evaluating LMMs on Multimodal Interaction

链接https://arxiv.org/abs/2603.13427

作者:Yu Miao,Zequn Yang,Yake Wei,Ziheng Chen,Haotian Ni,Haodong Duan,Kai Chen,Di Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multimodal, multimodal interaction, integrate and utilize, specific way based, interaction

备注: 10 pages

点击查看摘要

Abstract:In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.

351. 【2603.13425】Self-Flow-Matching assisted Full Waveform Inversion

链接https://arxiv.org/abs/2603.13425

作者:Xinquan Huang,Paris Perdikaris

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)

关键词:high-resolution seismic imaging, estimates subsurface velocity, seismic imaging method, recorded waveforms, high-resolution seismic

备注

点击查看摘要

Abstract:Full-waveform inversion (FWI) is a high-resolution seismic imaging method that estimates subsurface velocity by matching simulated and recorded waveforms. However, FWI is highly nonlinear, prone to cycle skipping, and sensitive to noise, particularly when low frequencies are missing or the initial model is poor, leading to failures under imperfect acquisition. Diffusion-regularized FWI introduces generative priors to encourage geologically realistic models, but these priors typically require costly offline pretraining and can deteriorate under distribution shift. Moreover, they assume Gaussian initialization and a fixed noise schedule, in which it is unclear how to map a deterministic FWI iterate and its starting model to a well-defined diffusion time or noise level. To address these limitations, we introduce Self-Flow-Matching assisted Full-Waveform Inversion (SFM-FWI), a physics-driven framework that eliminates the need for large-scale offline pretraining while avoiding the noise-level alignment ambiguity. SFM-FWI leverages flow matching to learn a transport field without assuming Gaussian initialization or a predefined noise schedule, so the initial model can be used directly as the starting point of the dynamics. Our approach trains a single flow network online using the governing physics and observed data. At each outer iteration, we build an interpolated model and update the flow by backpropagating the FWI data misfit, providing self-supervision without external training pairs. Experiments on challenging synthetic benchmarks show that SFM-FWI delivers more accurate reconstructions, greater noise robustness, and more stable convergence than standard FWI and pretraining-free regularization methods.

352. 【2603.13421】Generalization and Memorization in Rectified Flow

链接https://arxiv.org/abs/2603.13421

作者:Mingxing Rao,Daniel Moyer

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Flow Matching objective, Flow Matching, Rectified Flow, high-fidelity image synthesis, Matching objective

备注

点击查看摘要

Abstract:Generative models based on the Flow Matching objective, particularly Rectified Flow, have emerged as a dominant paradigm for efficient, high-fidelity image synthesis. However, while existing research heavily prioritizes generation quality and architectural scaling, the underlying dynamics of how RF models memorize training data remain largely underexplored. In this paper, we systematically investigate the memorization behaviors of RF through the test statistics of Membership Inference Attacks (MIA). We progressively formulate three test statistics, culminating in a complexity-calibrated metric ($T_\text{mc\_cal}$) that successfully decouples intrinsic image spatial complexity from genuine memorization signals. This calibration yields a significant performance surge -- boosting attack AUC by up to 15\% and the privacy-critical TPR@1\%FPR metric by up to 45\% -- establishing the first non-trivial MIA specifically tailored for RF. Leveraging these refined metrics, we uncover a distinct temporal pattern: under standard uniform temporal training, a model's susceptibility to MIA strictly peaks at the integration midpoint, a phenomenon we justify via the network's forced deviation from linear approximations. Finally, we demonstrate that substituting uniform timestep sampling with a Symmetric Exponential (U-shaped) distribution effectively minimizes exposure to vulnerable intermediate timesteps. Extensive evaluations across three datasets confirm that this temporal regularization suppresses memorization while preserving generative fidelity.

353. 【2603.13415】Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation

链接https://arxiv.org/abs/2603.13415

作者:Byeongjin Jung,Chanyeong Park,Sejoon Lim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:naturalistic environments, crucial for capturing, capturing the nuanced, human emotions, emotions in naturalistic

备注: 8pages

点击查看摘要

Abstract:Valence-arousal (VA) estimation is crucial for capturing the nuanced nature of human emotions in naturalistic environments. While pre-trained Vision-Language models like CLIP have shown remarkable semantic alignment capabilities, their application in continuous regression tasks is often limited by the discrete nature of text prompts. In this paper, we propose a novel multimodal framework for VA estimation that introduces Distance-aware Soft Prompt Learning to bridge the gap between semantic space and continuous dimensions. Specifically, we partition the VA space into a 3X3 grid, defining nine emotional regions, each associated with distinct textual descriptions. Rather than a hard categorization, we employ a Gaussian kernel to compute soft labels based on the Euclidean distance between the ground truth coordinates and the region centers, allowing the model to learn fine-grained emotional transitions. For multimodal integration, our architecture utilizes a CLIP image encoder and an Audio Spectrogram Transformer (AST) to extract robust spatial and acoustic features. These features are temporally modeled via Gated Recurrent Units (GRUs) and integrated through a hierarchical fusion scheme that sequentially combines cross-modal attention for alignment and gated fusion for adaptive refinement. Experimental results on the Aff-Wild2 dataset demonstrate that our proposed semantic-guided approach significantly enhances the accuracy of VA estimation, achieving competitive performance in unconstrained ``in-the-wild'' scenarios.

354. 【2603.13412】WAT: Online Video Understanding Needs Watching Before Thinking

链接https://arxiv.org/abs/2603.13412

作者:Zifan Han,Hongbo Sun,Jinglin Xu,Canhui Tang,Yulong Lei,Xuchong Zhang,Hongbin Sun,Zhongjiang He,Hao Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.

355. 【2603.13410】Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

链接https://arxiv.org/abs/2603.13410

作者:Xianqi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-based fall analysis, key bottleneck remains, Vision-based fall, advanced rapidly, bottleneck remains

备注: 19 pages, 4 figures

点击查看摘要

Abstract:Vision-based fall analysis has advanced rapidly, but a key bottleneck remains: visually similarmotions can correspond to very different physical outcomes because small differences in contactmechanics and protective responses are hard to infer from appearance alone. Most existingapproaches handle this by supervised injury prediction, which depends on reliable injury this http URL practice, such labels are difficult to obtain: video evidence is often ambiguous (occlusion,viewpoint limits), and true injury events are rare and cannot be safely staged, leading to noisysupervision. We address this problem with PHARL (PHysics-aware Alignment RepresentationLearning), which learns physically meaningful fall representations without requiring clinicaloutcome labels. PHARL regularizes motion embeddings with two complementary constraints:(1) trajectory-level temporal consistency for stable representation learning, and (2) multi-classphysics alignment, where simulation-derived contact outcomes shape embedding geometry. Bypairing video windows with temporally aligned simulation descriptors, PHARL captures localimpact-relevant dynamics while keeping inference purely feed-forward. Experiments on fourpublic datasets show that PHARL consistently improves risk-aligned representation quality overvisual-only baselines while maintaining strong fall-detection performance. Notably, PHARL alsoexhibits zero-shot ordinality: an interpretable severity structure (Head Trunk Supported)emerges without explicit ordinal supervision.

356. 【2603.13406】Nuanced Emotion Recognition Based on a Segment-based MLLM Framework Leveraging Qwen3-Omni for AH Detection

链接https://arxiv.org/abs/2603.13406

作者:Liang Tang,Hongda Li,Jiayu Zhang,Long Chen,Shuxian Li,Siqi Pei,Tiaonan Duan,Yuhao Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Hesitancy holds significant, identifying subtle psychological, subtle psychological states, Ambivalence and Hesitancy, Multimodal Large Language

备注: 5 pages, 1 figures

点击查看摘要

Abstract:Emotion recognition in videos is a pivotal task in affective computing, where identifying subtle psychological states such as Ambivalence and Hesitancy holds significant value for behavioral intervention and digital health. Ambivalence and Hesitancy states often manifest through cross-modal inconsistencies such as discrepancies between facial expressions, vocal tones, and textual semantics, posing a substantial challenge for automated recognition. This paper proposes a recognition framework that integrates temporal segment modeling with Multimodal Large Language Models. To address computational efficiency and token constraints in long video processing, we employ a segment-based strategy, partitioning videos into short clips with a maximum duration of 5 seconds. We leverage the Qwen3-Omni-30B-A3B model, fine-tuned on the BAH dataset using LoRA and full-parameter strategies via the MS-Swift framework, enabling the model to synergistically analyze visual and auditory signals. Experimental results demonstrate that the proposed method achieves an accuracy of 85.1% on the test set, significantly outperforming existing benchmarks and validating the superior capability of Multimodal Large Language Models in capturing complex and nuanced emotional conflicts. The code is released at this https URL.

357. 【2603.13405】Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

链接https://arxiv.org/abs/2603.13405

作者:Yang Yang,Tianyi Zhang,Wei Huang,Jinwei Chen,Boxi Wu,Xiaofei He,Deng Cai,Bo Li,Peng-Tao Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:requires prompt switching, maintaining perceptual fidelity, video generation requires, subjects or events, extended horizons

备注

点击查看摘要

Abstract:Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: this https URL

358. 【2603.13403】Diabetic Retinopathy Grading with CLIP-based Ranking-Aware Adaptation:A Comparative Study on Fundus Image

链接https://arxiv.org/abs/2603.13403

作者:Sungjun Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:automated fundus image, fundus image grading, Diabetic retinopathy, hybrid FCN-CLIP model, preventable blindness

备注

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a leading cause of preventable blindness, and automated fundus image grading can play an important role in large-scale screening. In this work, we investigate three CLIP-based approaches for five-class DR severity grading: (1) a zero-shot baseline using prompt engineering, (2) a hybrid FCN-CLIP model augmented with CBAM attention, and (3) a ranking-aware prompting model that encodes the ordinal structure of DR progression. We train and evaluate on a combined dataset of APTOS 2019 and Messidor-2 (n=5,406), addressing class imbalance through resampling and class-specific optimal thresholding. Our experiments show that the ranking-aware model achieves the highest overall accuracy (93.42%, AUROC 0.9845) and strong recall on clinically critical severe cases, while the hybrid FCN-CLIP model (92.49%, AUROC 0.99) excels at detecting proliferative DR. Both substantially outperform the zero-shot baseline (55.17%, AUROC 0.75). We analyze the complementary strengths of each approach and discuss their practical implications for screening contexts.

359. 【2603.13402】Event-Driven Video Generation

链接https://arxiv.org/abs/2603.13402

作者:Chika Maduabuchi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:support relations break, motion starts, objects drift, drift after placement, fail on simple

备注

点击查看摘要

Abstract:State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.

360. 【2603.13401】MAD: Microenvironment-Aware Distillation -- A Pretraining Strategy for Virtual Spatial Omics from Microscopy

链接https://arxiv.org/abs/2603.13401

作者:Jiashu Han,Kunzan Liu,Yeojin Kim,Saurabh Sinha,Sixian You

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Optics (physics.optics)

关键词:read molecular states, images-at single-cell resolution, Bridging microscopy, read molecular, molecular states

备注: 34 pages, 6 figures; under review

点击查看摘要

Abstract:Bridging microscopy and omics would allow us to read molecular states from images-at single-cell resolution and tissue scale-without the cost and throughput limits of omics technologies. Self-supervised pretraining offers a scalable approach with minimal labels, yet how to encode single-cell identity within tissue environments-and the extent of biological information such models can capture-remains an open question. Here, we introduce MAD (microenvironment-aware distillation), a pretraining strategy that learns cell-centric embeddings by jointly self-distilling the morphology view and the microenvironment view of the same indexed cell into a unified embedding space. Across diverse tissues and imaging modalities, MAD achieves state-of-the-art prediction performance on downstream tasks including cell subtyping, transcriptomic prediction, and bioinformatic inference. MAD even outperforms foundation models with a similar number of model parameters that have been trained on substantially larger datasets. These results demonstrate that MAD's dual-view joint self-distillation effectively captures the complexity and diversity of cells within tissues. Together, this establishes MAD as a general tool for representation learning in microscopy, enabling virtual spatial omics and biological insights from vast microscopy datasets.

361. 【2603.13400】Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

链接https://arxiv.org/abs/2603.13400

作者:Yunfei Huang,Elena Van der Vorst,Alexander Richard,Benedikt Sabass

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:surrounding extracellular matrix, Traction force microscopy, extracellular matrix, widely used technique, technique for quantifying

备注

点击查看摘要

Abstract:Traction force microscopy (TFM) is a widely used technique for quantifying the forces that cells exert on their surrounding extracellular matrix. Although deep learning methods have recently been applied to TFM data analysis, several challenges remain-particularly achieving reliable inference across multiple spatial scales and integrating additional contextual information such as cell type to improve accuracy. In this study, we propose ViT+UNet, a robust deep learning architecture that integrates a U-Net with a Vision Transformer. Our results demonstrate that this hybrid model outperforms both standalone U-Net and Vision Transformer architectures in predicting traction force fields. Furthermore, ViT+UNet exhibits superior generalization across diverse spatial scales and varying noise levels, enabling its application to TFM datasets obtained from different experimental setups and imaging systems. By appropriately structuring the input data, our approach also allows the inclusion of metadata, in our case cell-type information, to enhance prediction specificity and accuracy.

362. 【2603.13399】FlowAD: Ego-Scene Interactive Modeling for Autonomous Driving

链接https://arxiv.org/abs/2603.13399

作者:Mingzhe Guo,Yixiang Yang,Chuanrong Han,Rufeng Zhang,Shirui Li,Ji Wan,Zhipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Effective environment modeling, Effective environment, flow, Effective, scene flow

备注

点击查看摘要

Abstract:Effective environment modeling is the foundation for autonomous driving, underpinning tasks from perception to planning. However, current paradigms often inadequately consider the feedback of ego motion to the observation, which leads to an incomplete understanding of the driving process and consequently limits the planning capability. To address this issue, we introduce a novel ego-scene interactive modeling paradigm. Inspired by human recognition, the paradigm represents ego-scene interaction as the scene flow relative to the ego-vehicle. This conceptualization allows for modeling ego-motion feedback within a feature learning pattern, advantageously utilizing existing log-replay datasets rather than relying on scenario simulations. We specifically propose FlowAD, a general flow-based framework for autonomous driving. Within it, an ego-guided scene partition first constructs basic flow units to quantify scene flow. The ego-vehicle's forward direction and steering velocity directly shape the partition, which reflects ego motion. Then, based on flow units, spatial and temporal flow predictions are performed to model dynamics of scene flow, encompassing both spatial displacement and temporal variation. The final task-aware enhancement exploits learned spatio-temporal flow dynamics to benefit diverse tasks through object and region-level strategies. We also propose a novel Frames before Correct Planning (FCP) metric to assess the scene understanding capability. Experiments in both open and closed-loop evaluations demonstrate FlowAD's generality and effectiveness across perception, end-to-end planning, and VLM analysis. Notably, FlowAD reduces 19% collision rate over SparseDrive with FCP improvements of 1.39 frames (60%) on nuScenes, and achieves an impressive driving score of 51.77 on Bench2Drive, proving the superiority. Code, model, and configurations will be released here.

363. 【2603.13398】Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

链接https://arxiv.org/abs/2603.13398

作者:Daxiang Dong,Mingming Zheng,Dong Xu,Chunhua Luo,Bairong Zhuang,Yuxuan Li,Ruoyun He,Haoran Wang,Wenyu Zhang,Wenbo Wang,Yicheng Wang,Xue Xiong,Ayong Zheng,Xiaoying Zuo,Ziwei Ou,Jingnan Gu,Quanhao Guo,Jianmin Wu,Dawei Yin,Dou Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unifies document parsing, single architecture, key information extraction, document parsing, unifies document

备注

点击查看摘要

Abstract:We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

364. 【2603.13397】nnisExpert: Towards Expert-Level Analytical Sports Video Understanding

链接https://arxiv.org/abs/2603.13397

作者:Zhaoyu Liu,Xi Weng,Lianyu Hu,Zhe Hou,Kan Jiang,Jin Song Dong,Yang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating extensive broadcast, extensive broadcast footage, automated coaching, widely followed sports, generating extensive

备注

点击查看摘要

Abstract:Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics. Our dataset and code are publicly available at this https URL.

365. 【2603.13396】SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation

链接https://arxiv.org/abs/2603.13396

作者:Jan Kociszewski,Hubert Jastrzębski,Tymoteusz Stępkowski,Filip Manijak,Krzysztof Rojek,Franziska Boenisch,Adam Dziedzic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly effective method, intriguingly simple, simple yet highly, highly effective, diffusion models

备注: Accepted as an ICLR 2026 Poster

点击查看摘要

Abstract:We propose SERUM: an intriguingly simple yet highly effective method for marking images generated by diffusion models (DMs). We only add a unique watermark noise to the initial diffusion generation noise and train a lightweight detector to identify watermarked images, simplifying and unifying the strengths of prior approaches. SERUM provides robustness against any image augmentations or watermark removal attacks and is extremely efficient, all while maintaining negligible impact on image quality. In contrast to prior approaches, which are often only resilient to limited perturbations and incur significant training, injection, and detection costs, our SERUM achieves remarkable performance, with the highest true positive rate (TPR) at a 1% false positive rate (FPR) in most scenarios, along with fast injection and detection and low detector training overhead. Its decoupled architecture also seamlessly supports multiple users by embedding individualized watermarks with little interference between the marks. Overall, our method provides a practical solution to mark outputs from DMs and to reliably distinguish generated from natural images.

366. 【2603.13395】COT-FM: Cluster-wise Optimal Transport Flow Matching

链接https://arxiv.org/abs/2603.13395

作者:Chiensheng Chiang,Kuan-Hsun Tu,Jia-Wei Liao,Cheng-Fu Chou,Tsung-Wei Ke

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Flow Matching, path in Flow, general framework, framework that reshapes, reshapes the probability

备注: 18pages, CVPR 2026 accepted

点击查看摘要

Abstract:We introduce COT-FM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. FM models often produce curved trajectories due to random or batchwise couplings, which increase discretization error and reduce sample quality. COT-FM fixes this by clustering target samples and assigning each cluster a dedicated source distribution obtained by reversing pretrained FM models. This divide-and-conquer strategy yields more accurate local transport and significantly straighter vector fields, all without changing the model architecture. As a plug-and-play approach, COT-FM consistently accelerates sampling and improves generation quality across 2D datasets, image generation benchmarks, and robotic manipulation tasks.

367. 【2603.13394】Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models

链接https://arxiv.org/abs/2603.13394

作者:Sihan Cao,Jianwei Zhang,Pengcheng Zheng,Jiaxin Yan,Caiyan Qin,Yalan Ye,Wei Dong,Peng Wang,Yang Yang,Chaoning Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, Large Vision-Language, incur substantial inference, substantial inference costs, inference costs due

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{this https URL}{\textcolor{mypink}{this https URL}}.

368. 【2603.13393】Colony Grounded SAM2: Zero-shot detection and segmentation of bacterial colonies using foundation models

链接https://arxiv.org/abs/2603.13393

作者:Daan Korporaal,Patrick de Kruijf,Ralph H.G.M. Litjens,Bas H.M. van der Velden

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:segment bacterial colonies, propose Colony Grounded, bacterial colonies, images of agar-plates, agar-plates is important

备注

点击查看摘要

Abstract:The detection and classification of bacterial colonies in images of agar-plates is important in microbiology, but is hindered by the lack of labeled datasets. Therefore, we propose Colony Grounded SAM2, a zero-shot inference pipeline to detect and segment bacterial colonies in multiple settings without any further training. By utilizing the pre-trained foundation models Grounding DINO and Segment Anything Model 2, fine-tuned to the microbiological domain, we developed a model that is robust to data changes. Results showed a mean Average Precision of 93.1\% and a $Dice@detection$ score of 0.85, showing excellent detection and segmentation capabilities on out-of-distribution datasets. The entire pipeline with model weights are shared open access to aid with annotation- and classification purposes in microbiology.

369. 【2603.13392】Comparative Analysis of Deep Learning Architectures for Multi-Disease Classification of Single-Label Chest X-rays

链接https://arxiv.org/abs/2603.13392

作者:Ali M. Bahram,Saman Muhammad Omer,Hardi M. Mohammed

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:cardiac disorders worldwide, primary diagnostic tool, X-ray imaging remains, Chest X-ray imaging, Chest X-ray

备注: 19 pages, 9 figures, 12 tables. Published in Charmo Journal of Natural Sciences and Technologies (CJNST), 2026

点击查看摘要

Abstract:Chest X-ray imaging remains the primary diagnostic tool for pulmonary and cardiac disorders worldwide, yet its accuracy is hampered by radiologist shortages and inter-observer variability. This study presents a systematic comparative evaluation of seven deep learning architectures for multi-class chest disease classification: ConvNeXt-Tiny, DenseNet121, DenseNet201, ResNet50, ViT-B/16, EfficientNetV2-M, and MobileNetV2. A balanced dataset of 18,080 chest X-ray images spanning five disease categories (Cardiomegaly, COVID-19, Normal, Pneumonia, and Tuberculosis) was constructed from three public repositories and partitioned at the patient level to prevent data leakage. All models were trained under identical conditions using ImageNet-pretrained weights, standardized preprocessing, and consistent hyperparameters. All seven architectures exceeded 90% test accuracy. ConvNeXt-Tiny achieved the highest performance (92.31% accuracy, 95.70% AUROC), while MobileNetV2 emerged as the most parameter-efficient model (3.5M parameters, 90.42% accuracy, 94.10% AUROC), completing training in 48 minutes. Tuberculosis and COVID-19 classification was near-perfect (AUROC = 99.97%) across all architectures, while Normal, Cardiomegaly, and Pneumonia presented greater challenges due to overlapping radiographic features. Grad-CAM visualizations confirmed clinically consistent attention patterns across disease categories. These findings demonstrate that high-accuracy multi-disease chest X-ray classification is achievable without excessive computational resources, with important implications for AI-assisted diagnosis in both resource-rich and resource-constrained healthcare settings.

Comments:
19 pages, 9 figures, 12 tables. Published in Charmo Journal of Natural Sciences and Technologies (CJNST), 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

ACMclasses:
I.4.9; I.5.4

Cite as:
arXiv:2603.13392 [cs.CV]

(or
arXiv:2603.13392v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13392

Focus to learn more

              arXiv-issued DOI via DataCite

Journalreference:
Charmo Journal of Natural Sciences and Technologies (CJNST), Vol. 2, Issue 1, pp. 10-28, 2026

Related DOI:

https://doi.org/10.31530/cjnst.2026.2.1.2

Focus to learn more

            DOI(s) linking to related resources

Submission history From: Ali M. Bahram [view email] [v1]
Wed, 11 Mar 2026 07:52:36 UTC (1,346 KB)

370. 【2603.13391】WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

链接https://arxiv.org/abs/2603.13391

作者:Yuhong Dai,Yanlin Lai,Mitt Huang,Hangyu Guo,Dingming Li,Hongbo Peng,Haodong Li,Yingxiu Zhao,Haoran Lyu,Zheng Ge,Xiangyu Zhang,Daxin Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:web-generation benchmarks rely, screenshots as input, rely on text, text prompts, prompts or static

备注

点击查看摘要

Abstract:Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.

371. 【2603.13389】High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

链接https://arxiv.org/abs/2603.13389

作者:Ji Woo Hong,Hee Suk Yoon,Gwanhyeong Koo,Eunseop Yoon,SooHwan Eom,Qi Dai,Chong Luo,Chang D. Yoo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent large-scale vision-language, discrete image tokenization, fidelity remains constrained, Recent large-scale, large-scale vision-language models

备注

点击查看摘要

Abstract:Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.

372. 【2603.13388】VeloEdit: Training-Free Consistent and Continuous Instruction-Based Image Editing via Velocity Field Decomposition

链接https://arxiv.org/abs/2603.13388

作者:Zongqing Li,Zhihui Liu,Yujie Xie,Shansiyuan Wu,Hongshen Lv,Songzhi Su

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Instruction-based image editing, Instruction-based image, image editing aims, textual instructions, aims to modify

备注: 26 pages, 21 figures

点击查看摘要

Abstract:Instruction-based image editing aims to modify source content according to textual instructions. However, existing methods built upon flow matching often struggle to maintain consistency in non-edited regions due to denoising-induced reconstruction errors that cause drift in preserved content. Moreover, they typically lack fine-grained control over edit strength. To address these limitations, we propose VeloEdit, a training-free method that enables highly consistent and continuously controllable editing. VeloEdit dynamically identifies editing regions by quantifying the discrepancy between the velocity fields responsible for preserving source content and those driving the desired edits. Based on this partition, we enforce consistency in preservation regions by substituting the editing velocity with the source-restoring velocity, while enabling continuous modulation of edit intensity in target regions via velocity interpolation. Unlike prior works that rely on complex attention manipulation or auxiliary trainable modules, VeloEdit operates directly on the velocity fields. Extensive experiments on Flux.1 Kontext and Qwen-Image-Edit demonstrate that VeloEdit improves visual consistency and editing continuity with negligible additional computational cost. Code is available at this https URL.

373. 【2603.13387】Cylindrical Mechanical Projector for Omnidirectional Fringe Projection Profilometry

链接https://arxiv.org/abs/2603.13387

作者:Mincheol Choi,Gaeun Kim,Jae-Sang Hyun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significantly increased, increased in recent, recent years, reconstruction, Abstract

备注

点击查看摘要

Abstract:The demand for 360-degree 3D reconstruction has significantly increased in recent years across various domains such as the metaverse and 3D telecommunication. Accordingly, the importance of precise and wide-area 3D sensing technology has become emphasized. While the digital fringe projection method has been widely used due to its high accuracy and implementation flexibility, it suffers from fundamental limitations such as unidirectional projection and a restricted available light spectrum. To address these issues, this paper proposes a novel 3D reconstruction method based on a cylindrical mechanical projector. The proposed method consists of a rotational stage and a cylindrical pattern generator with ON/OFF slots at two distinct intervals, enabling omnidirectional projection of multi-frequency phase-shifted fringe patterns. By applying a multi-wavelength unwrapping algorithm and a quasi-calibration technique, the system achieves high-accuracy 3D reconstruction using only a single camera. Experimental results, supported by repeatability and reproducibility analyses together with a measurement uncertainty evaluation, confirm reliable measurement performance and practical feasibility for omnidirectional 3D reconstruction. The expanded uncertainty of the reconstructed depth was evaluated as 0.215 mm.

374. 【2603.13386】Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

链接https://arxiv.org/abs/2603.13386

作者:Yuntao Shou,Xiangyong Cao,Qian Zhao,Deyu Meng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Controllable pathology image, synthesis requires reliable, requires reliable regulation, Controllable pathology, pathology image synthesis

备注: 19 pages

点击查看摘要

Abstract:Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

375. 【2603.13385】VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

链接https://arxiv.org/abs/2603.13385

作者:Youting Wang,Yuan Tang,Yitian Qian,Chen Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:attacks remains under-evaluated, privacy-critical multimodal scenarios, explicit harmful content, Large Vision-Language Models, semantic visual attacks

备注

点击查看摘要

Abstract:As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated -- alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude~4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude~4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern -- where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4's leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.

376. 【2603.13383】aming Vision Priors for Data Efficient mmWave Channel Modeling

链接https://arxiv.org/abs/2603.13383

作者:Zhenlin An,Longfei Shangguan,John Kaewell,Philip Pietraski,Jelena Senic,Camillo Gentile,Nada Golmie,Kyle Jamieson

类目:Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)

关键词:Accurately modeling millimeter-wave, Accurately modeling, modeling millimeter-wave, propagation is essential, autonomous systems

备注

点击查看摘要

Abstract:Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10$\times$ while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.

377. 【2603.13382】DINOv3 with Test-Time Calibration for Automated Carotid Intima-Media Thickness Measurement on CUBS v1

链接https://arxiv.org/abs/2603.13382

作者:Zhenpeng Zhang,Jinwei Lu,Yurui Dong,Bo Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cardiovascular risk stratification, established vascular biomarker, measured from B-mode, Carotid intima-media thickness, Carotid Ultrasound Boundary

备注: 9 pages,3 figures

点击查看摘要

Abstract:Carotid intima-media thickness (CIMT) measured from B-mode ultrasound is an established vascular biomarker for atherosclerosis and cardiovascular risk stratification. Although a wide range of computerized methods have been proposed for carotid boundary delineation and CIMT estimation, robust and transferable deep models that jointly address segmentation and measurement remain underexplored, particularly in the era of vision foundation models. Motivated by recent advances in adapting DINOv3 to medical segmentation and exploiting DINOv3 in test-time optimization pipelines, we investigate a DINOv3-based framework for carotid intima-media complex segmentation and subsequent CIMT measurement on the Carotid Ultrasound Boundary Study (CUBS) v1 dataset. Our pipeline predicts the intima-media band at a fixed image resolution, extracts upper and lower boundaries column-wise, corrects for image resizing using the per-image calibration factor provided by CUBS, and reports CIMT in physical units. Across three patient-level test splits, our method achieved a mean test Dice of 0.7739 $\pm$ 0.0037 and IoU of 0.6384 $\pm$ 0.0044. The mean CIMT absolute error was 181.16 $\pm$ 11.57 $\mu$m, with a mean Pearson correlation of 0.480 $\pm$ 0.259. In a held-out validation subset ($n=28$), test-time threshold calibration reduced the mean absolute CIMT error from 141.0 $\mu$m at the default threshold to 101.1 $\mu$m at the measurement-optimized threshold, while simultaneously reducing systematic bias toward zero. Relative to the error ranges reported in the original CUBS benchmark for classical computerized methods, these results place a DINOv3-based approach within the clinically relevant $\sim$0.1 mm measurement regime. Together, our findings support the feasibility of using vision foundation models for interpretable, calibration-aware CIMT measurement.

378. 【2603.13377】Deep Learning for BioImaging: What Are We Learning?

链接https://arxiv.org/abs/2603.13377

作者:Ivan Svatko,Maxime Sanchez,Ihab Bendidi,Gilles Cottrell,Auguste Genovesio

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:driven major advances, Representation learning, driven major, major advances, analysis by enabling

备注

点击查看摘要

Abstract:Representation learning has driven major advances in natural image analysis by enabling models to acquire high-level semantic features. In microscopy imaging, however, it remains unclear what current representation learning methods actually learn. In this work, we conduct a systematic study of representation learning for the two most widely used and broadly available microscopy data types, representing critical scales in biology: cell culture and tissue imaging. To this end, we introduce a set of simple yet revealing baselines on curated benchmarks, including untrained models and simple structural representations of cellular tissue. Our results show that, surprisingly, state-of-the-art methods perform comparably to these baselines. We further show that, in contrast to natural images, existing models fail to consistently acquire high-level, biologically meaningful features. Moreover, we demonstrate that commonly used benchmark metrics are insufficient to assess representation quality and often mask this limitation. In addition, we investigate how detailed comparisons with these benchmarks provide ways to interpret the strengths and weaknesses of models for further improvements. Together, our results suggest that progress in microscopy image representation learning requires not only stronger models, but also more diagnostic benchmarks that measure what is actually learned.

379. 【2603.13376】A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans

链接https://arxiv.org/abs/2603.13376

作者:Maximo Rodriguez-Herrero,Dante D. Sanchez-Gallegos,Marco Antonio Núñez-Gaona,Heriberto Aguirre-Meneses,Luis Alberto Villalvazo Gutiérrez,Mario Ibrahin Gutiérrez Velasco,J.L. Gonzalez-Compean,Jesus Carretero

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:primary bone cancer, common primary bone, oldest populations, common primary, affecting the youngest

备注: 12 pages, Presented at The 2nd workshop about High-Performance e-Science

点击查看摘要

Abstract:Osteosarcoma is the most common primary bone cancer, mainly affecting the youngest and oldest populations. Its detection at early stages is crucial to reduce the probability of developing bone metastasis. In this context, accurate and fast diagnosis is essential to help physicians during the prognosis process. The research goal is to automate the diagnosis of osteosarcoma through a pipeline that includes the preprocessing, detection, postprocessing, and visualization of computed tomography (CT) scans. Thus, this paper presents a machine learning and visualization framework for classifying CT scans using different convolutional neural network (CNN) models. Preprocessing includes data augmentation and identification of the region of interest in scans. Post-processing includes data visualization to render a 3D bone model that highlights the affected area. An evaluation on 12 patients revealed the effectiveness of our framework, obtaining an area under the curve (AUC) of 94.8\% and a specificity of 94.6\%.

380. 【2603.13375】InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization

链接https://arxiv.org/abs/2603.13375

作者:Ronghui Li,Zhongyuan Hu,Li Siyao,Youliang Zhang,Haozhe Xie,Mingyuan Zhang,Jie Guo,Xiu Li,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:controlled scenarios, struggle to generalize, dance, generation methods perform, Foot Restoration Diffusion

备注: project page: [this https URL](https://infinitedance.github.io/)

点击查看摘要

Abstract:Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.

381. 【2603.13374】Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

链接https://arxiv.org/abs/2603.13374

作者:Ali Zia,Usman Ali,Muhammad Umer Ramzan,Hamza Abid,Abdul Rehman,Wei Xiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:geometry-agnostic feature fusion, existing methods largely, methods largely rely, video anomaly detection, supervised approaches

备注

点击查看摘要

Abstract:Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.

382. 【2603.13371】Agentic LLM Workflow for MR Spectroscopy Volume-of-Interest Placements in Brain Tumors

链接https://arxiv.org/abs/2603.13371

作者:Sangyoon Lee,Francesca Branzoli,Małgorzata Marjańska,Patrick Bolan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Magnetic resonance spectroscopy, clinically valuable metabolic, valuable metabolic characterization, Magnetic resonance, resonance spectroscopy

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Magnetic resonance spectroscopy (MRS) provides clinically valuable metabolic characterization of brain tumors, but its utility depends on accurate placement of the spectroscopy volume-of-interest (VOI). However, VOI placement typically has a broad operating window: for a given tumor there are multiple possible VOIs that would lead to high-quality MRS measurements. Thus, a VOI place-ment can be tuned for clinician preference, case-specific anatomy, and clinical pri-orities, which leads to high inter-operator variability, especially for heterogeneous tumors. We propose an agentic large language model (LLM) workflow that de-composes VOI placement into generation of diverse candidate VOIs, from which the LLM selects an optimal one based on quantitative metrics. Candidate VOIs are generated by vision transformer-based placement models trained with differ-ent objective function preferences, which allows selection from acceptable alterna-tives rather than a single deterministic placement. On 110 clinical brain tumor cas-es, the agentic workflow achieves improved solid tumor coverage and necrosis avoidance depending on the user preferences compared to the general-purpose expert placements. Overall, the proposed workflow provides a strategy to adapt VOI placement to different clinical objectives without retraining task-specific models.

383. 【2603.13370】GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

链接https://arxiv.org/abs/2603.13370

作者:Jiajin Liu,Dongzhe Fan,Chuanhao Ji,Daochen Zha,Qiaoyu Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remains largely underexplored, understanding multimodal signals, demonstrated remarkable capabilities, explicit relational graphs, remains largely

备注: CVPR 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision-language models as a new foundation for multimodal graph learning. The benchmark code is publicly available at this https URL.

384. 【2603.13369】Disentangling Prompt Dependence to Evaluate Segmentation Reliability in Gynecological MRI

链接https://arxiv.org/abs/2603.13369

作者:Elodie Germani(UR, LTSI),Krystel Nyangoh-Timoh,Pierre Jannin(LTSI),John S H Baxter

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enable generalizable, diverse domains, Promptable segmentation models, Segment Anything Models, Segment

备注

点击查看摘要

Abstract:Promptable segmentation models (e.g., the Segment Anything Models) enable generalizable, zero-shot segmentation across diverse domains. Although predictions are deterministic for a fixed image-prompt pair, the robustness of these models to variations in user prompts, referred to as prompt dependence, remains underexplored. In safety-critical workflows with substantial inter-user variability, interpretable and informative frameworks are needed to evaluate prompt dependence. In this work, we assess the reliability of promptable segmentation by analyzing and measuring its sensitivity to prompt variability. We introduce the first formulation of prompt dependence that explicitly disentangles prompt ambiguity (inter-user variability) from local sensitivity (interaction imprecision), offering an interpretable view of segmentation robustness. Experiments on two female pelvic MRI datasets for uterus and bladder segmentation reveal a strong negative correlation between both metrics and segmentation performance, highlighting the value of our framework for assessing robustness. The two metrics have low mutual correlation, supporting the disentangled design of our formulation, and provide meaningful indicators of prompt-related failure modes.

385. 【2603.13368】Real-Time Monocular Scene Analysis for UAV in Outdoor Environments

链接https://arxiv.org/abs/2603.13368

作者:Yara AlaaEldin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:leverage monocular cameras, low-altitude unstructured environments, leverage monocular, monocular cameras, robots to predict

备注

点击查看摘要

Abstract:In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture, named Co-SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited. We introduce a new synthetic dataset in this thesis, TopAir that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic-to-real generalization. Co-SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic to the realistic style. Cycle-GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic to real style transfer. In the end, we focus on the marine domain and address its challenges. Co-SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. The results reveal good generalization performance of Co-SemDepth when tested on real data from the SMD dataset while further enhancement is needed on the MIT dataset.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Cite as:
arXiv:2603.13368 [cs.CV]

(or
arXiv:2603.13368v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13368

Focus to learn more

              arXiv-issued DOI via DataCite</p>
386. 【2603.13367】Multimodal Deep Learning for Dynamic and Static Neuroimaging: Integrating MRI and fMRI for Alzheimer Disease Analysis

链接https://arxiv.org/abs/2603.13367

作者:Anima Kujur,Zahra Monfared

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, Mild Cognitive Impairment, Normal Cognitive State

备注

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) provides detailed structural information, while functional MRI (fMRI) captures temporal brain activity. In this work, we present a multimodal deep learning framework that integrates MRI and fMRI for multi-class classification of Alzheimer Disease (AD), Mild Cognitive Impairment, and Normal Cognitive State. Structural features are extracted from MRI using 3D convolutional neural networks, while temporal features are learned from fMRI sequences using recurrent architectures. These representations are fused to enable joint spatial-temporal learning. Experiments were conducted on a small paired MRI-fMRI dataset (29 subjects), both with and without data augmentation. Results show that data augmentation substantially improves classification stability and generalization, particularly for the multimodal 3DCNN-LSTM model. In contrast, augmentation was found to be ineffective for a large-scale single-modality MRI dataset. These findings highlight the importance of dataset size and modality when designing augmentation strategies for neuroimaging-based AD classification.

387. 【2603.13366】hinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

链接https://arxiv.org/abs/2603.13366

作者:Zhongxing Xu,Zhonghua Wang,Zhe Qian,Dachuan Shi,Feilong Tang,Ming Hu,Shiyan Su,Xiaocheng Zou,Wei Feng,Dwarikanath Mahapatra,Yifan Peng,Mingquan Lin,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significantly improved performance, Recent advancements, visual question answering, multimodal large reasoning, question answering

备注

点击查看摘要

Abstract:Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

388. 【2603.13365】WaveComm: Lightweight Communication for Collaborative Perception via Wavelet Feature Distillation

链接https://arxiv.org/abs/2603.13365

作者:Erdemt Bao,Jin Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:information exchange significantly, exchange significantly limits, significantly limits scalability, collaborative sensing systems, multi-agent collaborative sensing

备注: Accepted by ICRA 2026

点击查看摘要

Abstract:In multi-agent collaborative sensing systems, substantial communication overhead from information exchange significantly limits scalability and real-time performance, especially in bandwidth-constrained environments. This often results in degraded performance and reduced reliability. To address this challenge, we propose WaveComm, a wavelet-based communication framework that drastically reduces transmission loads while preserving sensing performance in low-bandwidth scenarios. The core innovation of WaveComm lies in decomposing feature maps using Discrete Wavelet Transform (DWT), transmitting only compact low-frequency components to minimize communication overhead. High-frequency details are omitted, and their effects are reconstructed at the receiver side using a lightweight generator. A Multi-Scale Distillation (MSD) Loss is employed to optimize the reconstruction quality across pixel, structural, semantic, and distributional levels. Experiments on the OPV2V and DAIR-V2X datasets for LiDAR-based and camera-based perception tasks demonstrate that WaveComm maintains state-of-the-art performance even when the communication volume is reduced to 86.3% and 87.0% of the original, respectively. Compared to existing approaches, WaveComm achieves competitive improvements in both communication efficiency and perception accuracy. Ablation studies further validate the effectiveness of its key components.

389. 【2603.13364】FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

链接https://arxiv.org/abs/2603.13364

作者:Ning Liao,Xiaoxing Wang,Xiaohan Qin,Junchi Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:model performance ceases, intermediate dimension exceeds, single-dimension fine-grained design, optimal threshold, limiting further gains

备注

点击查看摘要

Abstract:As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.

390. 【2603.13363】IAML: Illumination-Aware Mirror Loss for Progressive Learning in Low-Light Image Enhancement Auto-encoders

链接https://arxiv.org/abs/2603.13363

作者:Farida Mohsen,Tala Zaim,Ali Al-Zawqari,Ali Safa,Samir Belhaouari

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Illumination-Aware Mirror Loss, loss function, newly-proposed loss function, loss function termed, low-light image enhancement

备注

点击查看摘要

Abstract:This letter presents a novel training approach and loss function for learning low-light image enhancement auto-encoders. Our approach revolves around the use of a teacher-student auto-encoder setup coupled to a progressive learning approach where multi-scale information from clean image decoder feature maps is distilled into each layer of the student decoder in a mirrored fashion using a newly-proposed loss function termed Illumination-Aware Mirror Loss (IAML). IAML helps aligning the feature maps within the student decoder network with clean feature maps originating from the teacher side while taking into account the effect of lighting variations within the input images. Extensive benchmarking of our proposed approach on three popular low-light image enhancement datasets demonstrate that our model achieves state-of-the-art performance in terms of average SSIM, PSNR and LPIPS reconstruction accuracy metrics. Finally, ablation studies are performed to clearly demonstrate the effect of IAML on the image reconstruction accuracy.

391. 【2603.13361】BrainCast: A Spatio-Temporal Forecasting Model for Whole-Brain fMRI Time Series Prediction

链接https://arxiv.org/abs/2603.13361

作者:Yunlong Gao,Jinbo Yang,Li Xiao,Haiye Huo,Yang Ji,Hao Wang,Aiying Zhang,Yu-Ping Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

关键词:Functional magnetic resonance, fMRI time series, time series, magnetic resonance imaging, enables noninvasive investigation

备注

点击查看摘要

Abstract:Functional magnetic resonance imaging (fMRI) enables noninvasive investigation of brain function, while short clinical scan durations, arising from human and non-human factors, usually lead to reduced data quality and limited statistical power for neuroimaging research. In this paper, we propose BrainCast, a novel spatio-temporal forecasting framework specifically tailored for whole-brain fMRI time series forecasting, to extend informative fMRI time series without additional data acquisition. It formulates fMRI time series forecasting as a multivariate time series prediction task and jointly models temporal dynamics within regions of interest (ROIs) and spatial interactions across ROIs. Specifically, BrainCast integrates a Spatial Interaction Awareness module to characterize inter-ROI dependencies via embedding every ROI time series as a token, a Temporal Feature Refinement module to capture intrinsic neural dynamics within each ROI by enhancing both low- and high-energy temporal components of fMRI time series at the ROI level, and a Spatio-temporal Pattern Alignment module to combine spatial and temporal representations for producing informative whole-brain features. Experimental results on resting-state and task fMRI datasets from the Human Connectome Project demonstrate the superiority of BrainCast over state-of-the-art time series forecasting baselines. Moreover, fMRI time series extended by BrainCast improve downstream cognitive ability prediction, highlighting the clinical and neuroscientific impact brought by whole-brain fMRI time series forecasting in scenarios with restricted scan durations.

392. 【2603.13360】Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

链接https://arxiv.org/abs/2603.13360

作者:Hua Liu,Yanbin Wei,Fei Xing,Tyler Derr,Haoyu Han,Yu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:recommender systems, social media, traffic networks, real-world systems, common in real-world

备注

点击查看摘要

Abstract:Dynamic graphs are common in real-world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the complexity of temporal evolution. They tend to overlook fine-grained variations in temporal interaction order, struggle with dependencies that span long time horizons, and offer limited capability to model pair-specific relational dynamics. To address these challenges, we propose \textbf{Graph2Video}, a video-inspired framework that views the temporal neighborhood of a target link as a sequence of "graph frames". By stacking temporally ordered subgraph frames into a "graph video", Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight and plug-and-play link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state-of-the-art baselines on the link prediction task in most cases. The results highlight the potential of borrowing spatio-temporal modeling techniques from computer vision as a promising and effective approach for advancing dynamic graph learning.

393. 【2603.13357】Bi-CamoDiffusion: A Boundary-informed Diffusion Approach for Camouflaged Object Detection

链接https://arxiv.org/abs/2603.13357

作者:Patricia L. Suarez,Leo Thomas Ramos,Angel D. Sappa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:camouflaged object detection, CamoDiffusion framework, framework for camouflaged, object detection, camouflaged object

备注: 10 pages, 8 tables, 4 figures

点击查看摘要

Abstract:Bi-CamoDiffusion is introduced, an evolution of the CamoDiffusion framework for camouflaged object detection. It integrates edge priors into early-stage embeddings via a parameter-free injection process, which enhances boundary sharpness and prevents structural ambiguity. This is governed by a unified optimization objective that balances spatial accuracy, structural constraints, and uncertainty supervision, allowing the model to capture of both the object's global context and its intricate boundary transitions. Evaluations across the CAMO, COD10K, and NC4K benchmarks show that Bi-CamoDiffusion surpasses the baseline, delivering sharper delineation of thin structures and protrusions while also minimizing false positives. Also, our model consistently outperforms existing state-of-the-art methods across all evaluated metrics, including $S_m$, $F_{\beta}^{w}$, $E_m$, and $MAE$, demonstrating a more precise object-background separation and sharper boundary recovery.

394. 【2603.13355】Int3DNet: Scene-Motion Cross Attention Network for 3D Intention Prediction in Mixed Reality

链接https://arxiv.org/abs/2603.13355

作者:Taewook Ha,Woojin Cho,Dooyoung Kim,Woontack Woo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:explicit object-level perception, network that predicts, object-level perception, scene-aware network, explicit object-level

备注: Accepted as an IEEE TVCG paper at IEEE VR 2026 (journal track)

点击查看摘要

Abstract:We propose Int3DNet, a scene-aware network that predicts 3D intention areas directly from scene geometry and head-hand motion cues, enabling robust human intention prediction without explicit object-level perception. In Mixed Reality (MR), intention prediction is critical as it enables the system to anticipate user actions and respond proactively, reducing interaction delays and ensuring seamless user experiences. Our method employs a cross attention fusion of sparse motion cues and scene point clouds, offering a novel approach that directly interprets the user's spatial intention within the scene. We evaluated Int3DNet on MoGaze and CIRCLE datasets, which are public datasets for full-body human-scene interactions, showing consistent performance across time horizons of up to 1500 ms and outperforming the baselines, even in diverse and unseen scenes. Moreover, we demonstrate the usability of proposed method through a demonstration of efficient visual question answering (VQA) based on intention areas. Int3DNet provides reliable 3D intention areas derived from head-hand motion and scene geometry, thus enabling seamless interaction between humans and MR systems through proactive processing of intention areas.

395. 【2603.13354】AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

链接https://arxiv.org/abs/2603.13354

作者:Hamza Mooraj,George Pantazopoulos,Alessandro Suglia

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diverse acquisition conditions, Reliable crop disease, Convolutional Neural Networks, detection requires models, disease detection requires

备注: 11 pages main text, 22 pages total including references and appendix. 6 figures, 10 tables. Code and dataset will be released upon publication

点击查看摘要

Abstract:Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.

396. 【2603.13352】Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

链接https://arxiv.org/abs/2603.13352

作者:Xi Chen,Maojun Zhang,Yu Liu,Shen Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalization Semantic Segmentation, Domain Generalization Semantic, Domain Generalization, diverse acquisition conditions, significant performance degradation

备注

点击查看摘要

Abstract:Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.

397. 【2603.13349】MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval

链接https://arxiv.org/abs/2603.13349

作者:Fengbin Zhu,Zijing Cai,Yuzhe Wang,Pengyang Shao,Wenjie Wang,Fuli Feng,Richang Hong,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:global document structure, ensure retrieval efficacy, maintaining computational efficiency, Visual Document Retrieval, fine-grained visual details

备注

点击查看摘要

Abstract:Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel framework that employs VLMs as a hierarchical multi-resolution encoder, integrates resolution-level Matryoshka representation learning (RMRL) for effective feature fusion, and applies a semantic-aware hierarchical clustering mechanism for visual token compression. Experiments on two widely used VDR benchmarks show that our MURE framework consistently beats strong baselines. Furthermore, it significantly outperforms ColPali with only 50% of its visual token budget.

398. 【2603.13346】Post Training Quantization for Efficient Dataset Condensation

链接https://arxiv.org/abs/2603.13346

作者:Linh-Tam Tran,Sung-Ho Bae

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:reducing storage requirements, Dataset Condensation, distills knowledge, reducing storage, knowledge from large

备注: AAAI-2026 (Oral)

点击查看摘要

Abstract:Dataset Condensation (DC) distills knowledge from large datasets into smaller ones, accelerating training and reducing storage requirements. However, despite notable progress, prior methods have largely overlooked the potential of quantization for further reducing storage costs. In this paper, we take the first step to explore post-training quantization in dataset condensation, demonstrating its effectiveness in reducing storage size while maintaining representation quality without requiring expensive training cost. However, we find that at extremely low bit-widths (e.g., 2-bit), conventional quantization leads to substantial degradation in representation quality, negatively impacting the networks trained on these data. To address this, we propose a novel \emph{patch-based post-training quantization} approach that ensures localized quantization with minimal loss of information. To reduce the overhead of quantization parameters, especially for small patch sizes, we employ quantization-aware clustering to identify similar patches and subsequently aggregate them for efficient quantization. Furthermore, we introduce a refinement module to align the distribution between original images and their dequantized counterparts, compensating for quantization errors. Our method is a plug-and-play framework that can be applied to synthetic images generated by various DC methods. Extensive experiments across diverse benchmarks including CIFAR-10/100, Tiny ImageNet, and ImageNet subsets demonstrate that our method consistently outperforms prior works under the same storage constraints. Notably, our method nearly \textbf{doubles the test accuracy} of existing methods at extreme compression regimes (e.g., 26.0\% $\rightarrow$ 54.1\% for DM at IPC=1), while operating directly on 2-bit images without additional distillation.

399. 【2603.13345】DDS-UDA: Dual-Domain Synergy for Unsupervised Domain Adaptation in Joint Segmentation of Optic Disc and Optic Cup

链接https://arxiv.org/abs/2603.13345

作者:Yusong Xiao,Yuxuan Wu,Li Xiao,Gang Qu,Haiye Huo,Yu-Ping Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Convolutional neural networks, achieved exciting performance, Convolutional neural, neural networks, achieved exciting

备注

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have achieved exciting performance in joint segmentation of optic disc and optic cup on single-institution datasets. However, their clinical translation is hindered by two major challenges: limited availability of large-scale, high-quality annotations and performance degradation caused by domain shift during deployment across heterogeneous imaging protocols and acquisition platforms. While unsupervised domain adaptation (UDA) provides a way to mitigate these limitations, most existing approaches do not address cross-domain interference and intra-domain generalization within a unified framework. In this paper, we present the Dual-Domain Synergy UDA (DDS-UDA), a novel UDA framework that comprises two key modules. First, a bi-directional cross-domain consistency regularization module is enforced to mitigate cross-domain interference through feature-level semantic information exchange guided by a coarse-to-fine dynamic mask generator, suppressing noise propagation while preserving structural coherence. Second, a frequency-driven intra-domain pseudo label learning module is used to enhance intra-domain generalization by synthesizing spectral amplitude-mixed supervision signals, which ensures high-fidelity feature alignment across domains. Implemented within a teacher-student architecture, DDS-UDA disentangles domain-specific biases from domain-invariant feature-level representations, thereby achieving robust adaptation to heterogeneous imaging environments. We conduct a comprehensive evaluation of our proposed method on two multi-domain fundus image datasets, demonstrating that it outperforms several existing UDA based methods and therefore providing an effective way for optic disc and optic cup segmentation.

400. 【2603.13341】Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

链接https://arxiv.org/abs/2603.13341

作者:Zhenyu Zhang,Yixiong Zou,Yuhua Li,Ruixuan Li,Guangyao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Source-Free Cross-Domain Few-Shot, limited training data, Cross-Domain Few-Shot Learning, shown promising results, Source-Free Cross-Domain

备注: CVPR 2026

点击查看摘要

Abstract:Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}_{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}_{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at this https URL.

401. 【2603.13340】Complementarity-Supervised Spectral-Band Routing for Multimodal Emotion Recognition

链接https://arxiv.org/abs/2603.13340

作者:Zhexian Huang,Bo Zhao,Hui Ma,Zhishu Liu,Jie Zhang,Ruixin Zhang,Shouhong Ding,Zitong Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal emotion recognition, individual emotional states, recognition fuses cues, understand individual emotional, emotion recognition fuses

备注

点击查看摘要

Abstract:Multimodal emotion recognition fuses cues such as text, video, and audio to understand individual emotional states. Prior methods face two main limitations: mechanically relying on independent unimodal performance, thereby missing genuine complementary contributions, and coarse-grained fusion conflicting with the fine-grained representations required by emotion tasks. As inconsistent information density across heterogeneous modalities hinders inter-modal feature mining, we propose the Complementarity-Supervised Multi-Band Expert Network, named Atsuko, to model fine-grained complementary features via multi-scale band decomposition and expert collaboration. Specifically, we orthogonally decompose each modality's features into high, mid, and low-frequency components. Building upon this band-level routing, we design a modality-level router with a dual-path mechanism for fine-grained cross-band selection and cross-modal fusion. To mitigate shortcut learning from dominant modalities, we propose the Marginal Complementarity Module (MCM) to quantify performance loss when removing each modality via bi-modal comparison. The resulting complementarity distribution provides soft supervision, guiding the router to focus on modalities contributing unique information gains. Extensive experiments show our method achieves superior performance on the CMU-MOSI, CMU-MOSEI, CH-SIMS, CH-SIMSv2, and MIntRec benchmarks.

402. 【2603.13337】MultiSolSegment: Multi-channel segmentation of overlapping features in electroluminescence images of photovoltaic cells

链接https://arxiv.org/abs/2603.13337

作者:Ojas Sanghi(1),Norman Jost(1),Benjamin G. Pierce(2),Emma Cooper(3),Isaiah H. Deane(1),Jennifer L. Braid(1) ((1) Sandia National Laboratories, (2) Case Western Reserve University, (3) University of Colorado, Boulder)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:machine learning methods, enable large-scale analysis, imaging is widely, machine learning, applied to enable

备注: Published in Solar Energy (Elsevier), Volume 310, 2026

点击查看摘要

Abstract:Electroluminescence (EL) imaging is widely used to detect defects in photovoltaic (PV) modules, and machine learning methods have been applied to enable large-scale analysis of EL images. However, existing methods cannot assign multiple labels to the same pixel, limiting their ability to capture overlapping degradation features. We present a multi-channel U-Net architecture for pixel-level multi-label segmentation of EL images. The model outputs independent probability maps for cracks, busbars, dark areas, and non-cell regions, enabling accurate co-classification of interacting features such as cracks crossing busbars. The model achieved an accuracy of 98% and has been shown to generalize to unseen datasets. This framework offers a scalable, extensible tool for automated PV module inspection, improving defect quantification and lifetime prediction in large-scale PV systems.

403. 【2603.13335】Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

链接https://arxiv.org/abs/2603.13335

作者:Libang Zhao,Qixin Zeng,Hongyin Zhang,Donglin Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:open-ended robotic environments, severe catastrophic forgetting, robotic environments, acquire new skills, catastrophic forgetting

备注

点击查看摘要

Abstract:When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.

404. 【2603.13334】Lipschitz-Based Robustness Certification Under Floating-Point Execution

链接https://arxiv.org/abs/2603.13334

作者:Toby Murray

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Programming Languages (cs.PL)

关键词:Sensitivity-based robustness certification, require verifiable guarantees, Sensitivity-based robustness, practical approach, approach for certifying

备注

点击查看摘要

Abstract:Sensitivity-based robustness certification has emerged as a practical approach for certifying neural network robustness, including in settings that require verifiable guarantees. A key advantage of these methods is that certification is performed by concrete numerical computation (rather than symbolic reasoning) and scales efficiently with network size. However, as with the vast majority of prior work on robustness certification and verification, the soundness of these methods is typically proved with respect to a semantic model that assumes exact real arithmetic. In reality deployed neural network implementations execute using floating-point arithmetic. This mismatch creates a semantic gap between certified robustness properties and the behaviour of the executed system. As motivating evidence, we exhibit concrete counterexamples showing that real arithmetic robustness guarantees can fail under floating-point execution, even for previously verified certifiers, with discrepancies becoming pronounced at lower-precision formats such as float16. We then develop a formal, compositional theory relating real arithmetic Lipschitz-based sensitivity bounds to the sensitivity of floating-point execution under standard rounding-error models, specialised to feed-forward neural networks with ReLU activations. We derive sound conditions for robustness under floating-point execution, including bounds on certificate degradation and sufficient conditions for the absence of overflow. We formalize the theory and its main soundness results, and implement an executable certifier based on these principles, which we empirically evaluate to demonstrate its practicality.

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Programming Languages (cs.PL)

Cite as:
arXiv:2603.13334 [cs.LG]

(or
arXiv:2603.13334v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.13334

Focus to learn more

              arXiv-issued DOI via DataCite</p>
405. 【2603.13330】RBF-Solver: A Multistep Sampler for Diffusion Probabilistic Models via Radial Basis Functions

链接https://arxiv.org/abs/2603.13330

作者:Soochul Park,Yeon Ju Lee,SeongJin Yoon,Jiyub Shin,Juhee Lee,Seongwoon Jo

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:outstanding generative fidelity, computationally demanding, widely adopted, outstanding generative, Diffusion probabilistic models

备注: 49 pages , 5 figures , Preprint

点击查看摘要

Abstract:Diffusion probabilistic models (DPMs) are widely adopted for their outstanding generative fidelity, yet their sampling is computationally demanding. Polynomial-based multistep samplers mitigate this cost by accelerating inference; however, despite their theoretical accuracy guarantees, they generate the sampling trajectory according to a predefined scheme, providing no flexibility for further optimization. To address this limitation, we propose RBF-Solver, a multistep diffusion sampler that interpolates model evaluations with Gaussian radial basis functions (RBFs). By leveraging learnable shape parameters in Gaussian RBFs, RBF-Solver explicitly follows optimal sampling trajectories. At first order, it reduces to the Euler method (DDIM). At second order or higher, as the shape parameters approach infinity, RBF-Solver converges to the Adams method, ensuring its compatibility with existing samplers. Owing to the locality of Gaussian RBFs, RBF-Solver maintains high image fidelity even at fourth order or higher, where previous samplers deteriorate. For unconditional generation, RBF-Solver consistently outperforms polynomial-based samplers in the high-NFE regime (NFE = 15). On CIFAR-10 with the Score-SDE model, it achieves an FID of 2.87 with 15 function evaluations and further improves to 2.48 with 40 function evaluations. For conditional ImageNet 256 x 256 generation with the Guided Diffusion model at a guidance scale 8.0, substantial gains are achieved in the low-NFE range (5-10), yielding a 16.12-33.73% reduction in FID relative to polynomial-based samplers.

406. 【2603.13306】Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision

链接https://arxiv.org/abs/2603.13306

作者:Kirill Borodin,Kirill Kondrashov,Nikita Vasiliev,Ksenia Gladkova,Inna Larina,Mikhail Gorodnichev,Grach Mkrtchian

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:CCTV safety monitoring, safety monitoring demands, CCTV safety, detectors combine reliable, monitoring demands anomaly

备注: Published ad MDPI Journal of Imaging (see at [this https URL](https://www.mdpi.com/2313-433X/11/11/400) )

点击查看摘要

Abstract:CCTV safety monitoring demands anomaly detectors combine reliable clip-level accuracy with predictable per-clip latency despite weak supervision. This work investigates compact vision-language models (VLMs) as practical detectors for this regime. A unified evaluation protocol standardizes preprocessing, prompting, dataset splits, metrics, and runtime settings to compare parameter-efficiently adapted compact VLMs against training-free VLM pipelines and weakly supervised baselines. Evaluation spans accuracy, precision, recall, F1, ROC-AUC, and average per-clip latency to jointly quantify detection quality and efficiency. With parameter-efficient adaptation, compact VLMs achieve performance on par with, and in several cases exceeding, established approaches while retaining competitive per-clip latency. Adaptation further reduces prompt sensitivity, producing more consistent behavior across prompt regimes under the shared protocol. These results show that parameter-efficient fine-tuning enables compact VLMs to serve as dependable clip-level anomaly detectors, yielding a favorable accuracy-efficiency trade-off within a transparent and consistent experimental setup.

407. 【2603.13300】Safety-Guided Flow (SGF): A Unified Framework for Negative Guidance in Safe Generation

链接https://arxiv.org/abs/2603.13300

作者:Mingyu Kim,Young-Heon Kim,Mijung Park

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:distinct paths, flow models, models have recently, recently been developed, diffusion and flow

备注: ICLR2026 Oral, Code is available at [this https URL](https://github.com/MingyuKim87/SGF)

点击查看摘要

Abstract:Safety mechanisms for diffusion and flow models have recently been developed along two distinct paths. In robot planning, control barrier functions are employed to guide generative trajectories away from obstacles at every denoising step by explicitly imposing geometric constraints. In parallel, recent data-driven, negative guidance approaches have been shown to suppress harmful content and promote diversity in generated samples. However, they rely on heuristics without clearly stating when safety guidance is actually necessary. In this paper, we first introduce a unified probabilistic framework using a Maximum Mean Discrepancy (MMD) potential for image generation tasks that recasts both Shielded Diffusion and Safe Denoiser as instances of our energy-based negative guidance against unsafe data samples. Furthermore, we leverage control-barrier functions analysis to justify the existence of a critical time window in which negative guidance must be strong; outside of this window, the guidance should decay to zero to ensure safe and high-quality generation. We evaluate our unified framework on several realistic safe generation scenarios, confirming that negative guidance should be applied in the early stages of the denoising process for successful safe generation.

408. 【2603.13273】Spatially Aware Deep Learning for Microclimate Prediction from High-Resolution Geospatial Imagery

链接https://arxiv.org/abs/2603.13273

作者:Idan Sulami,Alon Itzkovitch,Michael R. Kearney,Moni Shahar,Ofir Levy

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:physically based frameworks, based frameworks estimate, frameworks estimate temperature, estimate temperature independently, lateral heat exchange

备注: code and sample data are available at [this https URL](https://github.com/levyofi/Sulami_et_al_2026)

点击查看摘要

Abstract:Microclimate models are essential for linking climate to ecological processes, yet most physically based frameworks estimate temperature independently for each spatial unit and rely on simplified representations of lateral heat exchange. As a result, the spatial scales over which surrounding environmental conditions influence local microclimates remain poorly quantified. Here, we show how remote sensing can help quantify the contribution of spatial context to microclimate temperature predictions. Building on convolutional neural network principles, we designed a task-specific deep neural network and trained a series of models in which the spatial extent of input data was systematically varied. Drone-derived spatial layers and meteorological data were used to predict ground temperature at a focal location, allowing direct assessment of how prediction accuracy changes with increasing spatial context. Our results show that incorporating spatially adjacent information substantially improves prediction accuracy, with diminishing returns beyond spatial extents of approximately 5-7 m. This characteristic scale indicates that ground temperatures are influenced not only by local surface properties, but also by horizontal heat transfer and radiative interactions operating across neighboring microhabitats. The magnitude of spatial effects varied systematically with time of day, microhabitat type, and local environmental characteristics, highlighting context-dependent spatial coupling in microclimate formation. By treating deep learning as a diagnostic tool rather than solely a predictive one, our approach provides a general and transferable method for quantifying spatial dependencies in microclimate models and informing the development of hybrid mechanistic-data-driven approaches that explicitly account for spatial interactions while retaining physical interpretability.

409. 【2603.13261】Deep Convolutional Architectures for EEG Classification: A Comparative Study with Temporal Augmentation and Confidence-Based Voting

链接https://arxiv.org/abs/2603.13261

作者:Aryan Patodiya,Hubert Cecotti

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:remains challenging due, limited data availability, brain-computer interface, plays a key, key role

备注: 14 pages, 8 figures, Recent Trends in Image Processing and Pattern Recognition, Copyright held by Springer CCIS

点击查看摘要

Abstract:Electroencephalography (EEG) classification plays a key role in brain-computer interface (BCI) systems, yet it remains challenging due to the low signal-to-noise ratio, temporal variability of neural responses, and limited data availability. In this paper, we present a comparative study of deep learning architectures for classifying event-related potentials (ERPs) in EEG signals. The preprocessing pipeline includes bandpass filtering, spatial filtering, and normalization. We design and compare three main pipelines: a 2D convolutional neural network (CNN) using Common Spatial Pattern (CSP), a second 2D CNN trained directly on raw data for a fair comparison, and a 3D CNN that jointly models spatiotemporal representations. To address ERP latency variations, we introduce a temporal shift augmentation strategy during training. At inference time, we employ a confidence-based test-time voting mechanism to improve prediction stability across shifted trials. An experimental evaluation on a stratified five-fold cross-validation protocol demonstrates that while CSP provides a benefit to the 2D architecture, the proposed 3D CNN significantly outperforms both 2D variants in terms of AUC and balanced accuracy. These findings highlight the effectiveness of temporal-aware architectures and augmentation strategies for robust EEG signal classification.

410. 【2603.13240】Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field

链接https://arxiv.org/abs/2603.13240

作者:Ozge Mercanoglu Sincan,Jian He Low,Sobhan Asasi,Richard Bowden

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Sign Language Translation, convert visual sign, visual sign language, automatically convert visual, spoken language text

备注: This is a preprint of an article published in Computer Vision and Image Understanding (CVIU)

点击查看摘要

Abstract:Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (this https URL) to support transparency and reproducibility in SLT research.

Comments:
This is a preprint of an article published in Computer Vision and Image Understanding (CVIU)

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Cite as:
arXiv:2603.13240 [cs.CV]

(or
arXiv:2603.13240v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13240

Focus to learn more

              arXiv-issued DOI via DataCite

Journalreference:
Computer Vision and Image Understanding, vol. 261, p.104498, 2025

Related DOI:

https://doi.org/10.1016/j.cviu.2025.104498

Focus to learn more

            DOI(s) linking to related resources</p>
411. 【2603.13238】KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

链接https://arxiv.org/abs/2603.13238

作者:Henry Gagnier,Sophie Gagnier,Ashwin Kirubakaran

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Arabic script OCR, optical character recognition, Arabic script, OCR, Arabic

备注: Accepted to AbjadNLP @ EACL 2026

点击查看摘要

Abstract:Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.

412. 【2603.10165】OpenClaw-RL: Train Any Agent Simply by Talking

链接https://arxiv.org/abs/2603.10165

作者:Yinjie Wang,Xuyang Chen,Xiaolong Jin,Mengdi Wang,Ling Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:online learning source, GUI state change, tool output, online learning, learning source

备注: Code: [this https URL](https://github.com/Gen-Verse/OpenClaw-RL)

点击查看摘要

Abstract:Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: this https URL

413. 【2602.20409】CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

链接https://arxiv.org/abs/2602.20409

作者:Mainak Singha,Sarthak Mehrotra,Paolo Casari,Subhasis Chaudhuri,Elisa Ricci,Biplab Banerjee

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent vision-language models, impressive cross-modal reasoning, demonstrate impressive cross-modal, Recent vision-language, CLIP demonstrate impressive

备注: Accepted in CVPR 2026

点击查看摘要

Abstract:Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at this https URL.

414. 【2601.17468】ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation

链接https://arxiv.org/abs/2601.17468

作者:Chia-Ming Lee,Yu-Fan Lin,Jing-Hui Jung,Yu-Jou Hsiao,Chih-Chung Hsu,Yu-Lun Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Single Image Reflection, disentangles mixed images, Image Reflection Separation, Single Image, Image Reflection

备注: Project page: [this https URL](https://wuw2135.github.io/ReflexSplit-ProjectPage/)

点击查看摘要

Abstract:Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmission-reflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi-scale coordination. We propose ReflexSplit, a dual-stream framework with three key innovations. (1) Cross-scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion-Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer-specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual-stream separation via cross-stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth-dependent initialization and epoch-wise warmup. Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance with superior perceptual quality and robust generalization. Our code is available at this https URL.

415. 【2511.06694】ML-EcoLyzer: Quantifying the Environmental Cost of Machine Learning Inference Across Frameworks and Hardware

链接https://arxiv.org/abs/2511.06694

作者:Jose Marie Antonio Minoza,Rex Gregor Laylo,Christian F Villarin,Sebastian C. Ibanez

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)

关键词:remains poorly quantified, impact remains poorly, environmental impact remains, massive scale, poorly quantified

备注

点击查看摘要

Abstract:Machine learning inference occurs at a massive scale, yet its environmental impact remains poorly quantified, especially on low-resource hardware. We present ML-EcoLyzer, a cross-framework tool for measuring the carbon, energy, thermal, and water costs of inference across CPUs, consumer GPUs, and datacenter accelerators. The tool supports both classical and modern models, applying adaptive monitoring and hardware-aware evaluation. We introduce the Environmental Sustainability Score (ESS), which quantifies the number of effective parameters served per gram of CO$_2$ emitted. Our evaluation covers over 1,900 inference configurations, spanning diverse model architectures, task modalities (text, vision, audio, tabular), hardware types, and precision levels. These rigorous and reliable measurements demonstrate that quantization enhances ESS, huge accelerators can be inefficient for lightweight applications, and even small models may incur significant costs when implemented suboptimally. ML-EcoLyzer sets a standard for sustainability-conscious model selection and offers an extensive empirical evaluation of environmental costs during inference.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)

Cite as:
arXiv:2511.06694 [cs.LG]

(or
arXiv:2511.06694v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2511.06694

Focus to learn more

              arXiv-issued DOI via DataCite

Journalreference:
Association for the Advancement of Artificial Intelligence (2026). AI for Environmental Science

416. 【2504.14372】Learning Enhanced Structural Representations with Block-Based Uncertainties for Ocean Floor Mapping

链接https://arxiv.org/abs/2504.14372

作者:Jose Marie Antonio Minoza

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:current worldwide datasets, exact numerical simulations, Accurate ocean modeling, Accurate ocean, current worldwide

备注

点击查看摘要

Abstract:Accurate ocean modeling and coastal hazard prediction depend on high-resolution bathymetric data; yet, current worldwide datasets are too coarse for exact numerical simulations. While recent deep learning advances have improved earth observation data resolution, existing methods struggle with the unique challenges of producing detailed ocean floor maps, especially in maintaining physical structure consistency and quantifying uncertainties. This work presents a novel uncertainty-aware mechanism using spatial blocks to efficiently capture local bathymetric complexity based on block-based conformal prediction. Using the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, the integration of this uncertainty quantification framework yields spatially adaptive confidence estimates while preserving topographical features via discrete latent representations. With smaller uncertainty widths in well-characterized areas and appropriately larger bounds in areas of complex seafloor structures, the block-based design adapts uncertainty estimates to local bathymetric complexity. Compared to conventional techniques, experimental results over several ocean regions show notable increases in both reconstruction quality and uncertainty estimation reliability. This framework increases the reliability of bathymetric reconstructions by preserving structural integrity while offering spatially adaptive uncertainty estimates, so opening the path for more solid climate modeling and coastal hazard assessment.

417. 【2603.15582】Benchmarking Machine Learning Approaches for Polarization Mapping in Ferroelectrics Using 4D-STEM

链接https://arxiv.org/abs/2603.15582

作者:Matej Martinc,Goran Dražič,Anton Kokalj,Katarina Žiberna,Janina Roknić,Matic Poberžnik,Sašo Džeroski,Andreja Benčan Golob

类目:Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)

关键词:Four-dimensional scanning transmission, Four-dimensional scanning, scanning transmission electron, atomic-scale insights, scanning transmission

备注

点击查看摘要

Abstract:Four-dimensional scanning transmission electron microscopy (4D-STEM) provides rich, atomic-scale insights into materials structures. However, extracting specific physical properties - such as polarization directions essential for understanding functional properties of ferroelectrics - remains a significant challenge. In this study, we systematically benchmark multiple machine learning models, namely ResNet, VGG, a custom convolutional neural network, and PCA-informed k-Nearest Neighbors, to automate the detection of polarization directions from 4D-STEM diffraction patterns in ferroelectric potassium sodium niobate. While models trained on synthetic data achieve high accuracy on idealized synthetic diffraction patterns of equivalent thickness, the domain gap between simulation and experiment remains a critical barrier to real-world deployment. In this context, a custom made prototype representation training regime and PCA-based methods, combined with data augmentation and filtering, can better bridge this gap. Error analysis reveals periodic missclassification patterns, indicating that not all diffraction patterns carry enough information for a successful classification. Additionally, our qualitative analysis demonstrates that irregularities in the model's prediction patterns correlate with defects in the crystal structure, suggesting that supervised models could be used for detecting structural defects. These findings guide the development of robust, transferable machine learning tools for electron microscopy analysis.

418. 【2603.15143】Clinical Priors Guided Lung Disease Detection in 3D CT Scans

链接https://arxiv.org/abs/2603.15143

作者:Kejin Lu,Jianfa Bai,Qingqiu Li,Runtian Yuan,Jilan Xu Junlin Hou,Yuejie Zhang,Rui Feng

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:computer-aided diagnosis systems, Accurate classification, diagnosis systems, plays an important, important role

备注

点击查看摘要

Abstract:Accurate classification of lung diseases from chest CT scans plays an important role in computer-aided diagnosis systems. However, medical imaging datasets often suffer from severe class imbalance, which may significantly degrade the performance of deep learning models, especially for minority disease categories. To address this issue, we propose a gender-aware two-stage lung disease classification framework. The proposed approach explicitly incorporates gender information into the disease recognition pipeline. In the first stage, a gender classifier is trained to predict the patient's gender from CT scans. In the second stage, the input CT image is routed to a corresponding gender-specific disease classifier to perform final disease prediction. This design enables the model to better capture gender-related imaging characteristics and alleviate the influence of imbalanced data distribution. Experimental results demonstrate that the proposed method improves the recognition performance for minority disease categories, particularly squamous cell carcinoma, while maintaining competitive performance on other classes.

419. 【2603.14644】LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

链接https://arxiv.org/abs/2603.14644

作者:Hongyi Pan,Gorkem Durak,Halil Ertugrul Aktas,Andrea M. Bejar,Baver Tutun,Emre Uysal,Ezgi Bulbul,Mehmet Fatih Dogan,Berrin Erok,Berna Akkus Yildirim,Sukru Mehmet Erturk,Ulas Bagci

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Machine Learning (cs.LG)

关键词:datasets remain limited, Publicly available full-field, multi-vendor FFDM dataset, full-field digital mammography, limited in size

备注: This paper was accepted to CVPR 2026

点击查看摘要

Abstract:Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method (''energy harmonization'') that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.

420. 【2603.13967】EchoLVFM: One-Step Video Generation via Latent Flow Matching for Echocardiogram Synthesis

链接https://arxiv.org/abs/2603.13967

作者:Emmanuel Oladokun,Sarina Thomas,Jurica Šprem,Vicente Grau

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:assessing cardiac function, left-ventricular ejection fraction, clinically meaningful parameters, Echocardiography is widely, cardiac function

备注: Submitted to MICCAI 2026; Under Review

点击查看摘要

Abstract:Echocardiography is widely used for assessing cardiac function, where clinically meaningful parameters such as left-ventricular ejection fraction (EF) play a central role in diagnosis and management. Generative models capable of synthesising realistic echocardiogram videos with explicit control over such parameters are valuable for data augmentation, counterfactual analysis, and specialist training. However, existing approaches typically rely on computationally expensive multi-step sampling and aggressive temporal normalisation, limiting efficiency and applicability to heterogeneous real-world data. We introduce EchoLVFM, a one-step latent video flow-matching framework for controllable echocardiogram generation. Operating in the latent space, EchoLVFM synthesises temporally coherent videos in a single inference step, achieving a $\mathbf{\sim 50\times}$ improvement in sampling efficiency compared to multi-step flow baselines while maintaining visual fidelity. The model supports global conditioning on clinical variables, demonstrated through precise control of EF, and enables reconstruction and counterfactual generation from partially observed sequences. A masked conditioning strategy further removes fixed-length constraints, allowing shorter sequences to be retained rather than discarded. We evaluate EchoLVFM on the CAMUS dataset under challenging single-frame conditioning. Quantitative and qualitative results demonstrate competitive video quality, strong EF adherence, and 57.9% discrimination accuracy by expert clinicians which is close to chance. These findings indicate that efficient, one-step flow matching can enable practical, controllable echocardiogram video synthesis without sacrificing fidelity. Code available at: this https URL

Comments:
Submitted to MICCAI 2026; Under Review

Subjects:

Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.13967 [eess.IV]

(or
arXiv:2603.13967v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2603.13967

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
421. 【2603.13666】Unsupervised Adaptation from FDG to PSMA PET/CT for 3D Lesion Detection under Label Shift

链接https://arxiv.org/abs/2603.13666

作者:Xiaofeng Liu,Menghua Xia,Yanis Chemli,Georges El Fakhri,Chi Liu,Jinsong Ouyang

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:unlabeled PSMA PET, labeled FDG PET, FDG PET, PSMA PET, PET

备注: IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:In this work, we propose an unsupervised domain adaptation (UDA) framework for 3D volumetric lesion detection that adapts a detector trained on labeled FDG PET/CT to unlabeled PSMA PET/CT. Beyond covariate shift, cross tracer adaptation also exhibits label shift in both lesion size composition and the number of lesions per subject. We introduce self-training with two mechanisms that explicitly model and compensate for this label shift. First, we adaptively adjust the detection anchor shapes by re-estimating target domain box scales from selected pseudo labels and updating anchors with an exponential moving average. This increases positive anchor coverage for small PSMA lesions and stabilizes box regression. Second, instead of a fixed confidence threshold for pseudo-label selection, we allocate size bin-wise quotas according to the estimated target domain histogram over lesion volumes. The self-training alternates between supervised learning with prior-guided pseudo labeling on PSMA and supervised learning on labeled FDG. On AutoPET 2024, adapting from 501 labeled FDG studies to 369 $^{18}$F-PSMA studies, the proposed method improves both AP and FROC over the source-only baseline and conventional self-training without label-shift mitigation, indicating that modeling target lesion prevalence and size composition is an effective path to robust cross-tracer detection.

422. 【2603.13466】Open World MRI Reconstruction with Bias-Calibrated Adaptation

链接https://arxiv.org/abs/2603.13466

作者:Jiyao Liu,Shangqi Gao,Lihao Liu,Junzhi Ning,Jinjie Wei,Junjun He,Xiahai Zhuang,Ningsheng Xu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-world MRI reconstruction, Real-world MRI, unseen imaging centers, MRI reconstruction systems, causing severe performance

备注

点击查看摘要

Abstract:Real-world MRI reconstruction systems face the open-world challenge: test data from unseen imaging centers, anatomical structures, or acquisition protocols can differ drastically from training data, causing severe performance degradation. Existing methods struggle with this challenge. To address this, we propose BiasRecon, a bias-calibrated adaptation framework grounded in the minimal intervention principle: preserve what transfers, calibrate what does not. Concretely, BiasRecon formulates open-world adaptation as an alternating optimization framework that jointly optimizes three components: (1) frequency-guided prior calibration that introduces layer-wise calibration variables to selectively modulate frequency-specific features of the pre-trained score network via self-supervised k-space signals, (2) score-based denoising that leverages the calibrated generative prior for high-fidelity image reconstruction, and (3) adaptive regularization that employs Stein's Unbiased Risk Estimator to dynamically balance the prior-measurement trade-off, matching test-time noise characteristics without requiring ground truth. By intervening minimally and precisely through this alternating scheme, BiasRecon achieves robust adaptation with fewer than 100 tunable parameters. Extensive experiments across four datasets demonstrate state-of-the-art performance on open-world reconstruction tasks.

423. 【2603.13447】MGMAR: Metal-Guided Metal Artifact Reduction for X-ray Computed Tomography

链接https://arxiv.org/abs/2603.13447

作者:Hyoung Suk Park,Kiwan Jeon

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:X-ray computed tomography, degrade diagnostic quality, metallic implants violate, implants violate standard, X-ray computed

备注: 27 pages, 8 figures, 3 tables

点击查看摘要

Abstract:An X-ray computed tomography (CT), metal artifact reduction (MAR) remains a major challenge because metallic implants violate standard CT forward-model assumptions, producing severe streaking and shadowing artifacts that degrade diagnostic quality. We propose MGMAR, a metal-guided MAR method that explicitly leverages metal-related information throughout the reconstruction pipeline. MGMAR first generates a high-quality prior image by training a conditioned implicit neural representation (INR) using metal-unaffected projections, and then incorporates this prior into a normalized MAR (NMAR) framework for projection completion. To improve robustness under severe metal corruption, we pretrain the encoder-conditioned INR on paired metal-corrupted and artifact-free CT images, thereby embedding data-driven prior knowledge into the INR parameter space. This prior-embedded initialization reduces sensitivity to random initialization and accelerates convergence during measurement-specific refinement. The encoder takes a metal-corrupted reconstruction together with a recursively constructed metal artifact image, enabling the latent field to capture metal-dependent global artifact patterns. After projection completion using the INR prior, we further suppress residual artifacts using a metal-conditioned correction network, where the metal mask modulates intermediate features via adaptive instance normalization to target metal-dependent secondary artifacts while preserving anatomical structures. Experiments on the public AAPM-MAR benchmark demonstrate that MGMAR achieves state-of-the-art performance, attaining an average final score of 0.89 on 29 clinical test cases.

424. 【2603.13439】Bayesian Uncertainty-Aware MRI Reconstruction

链接https://arxiv.org/abs/2603.13439

作者:Ahmed Karam Eldaly,Matteo Figini,Daniel C. Alexander

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:under-sampled k-space measurements, joint magnetic resonance, magnetic resonance image, resonance image reconstruction, k-space measurements

备注

点击查看摘要

Abstract:We propose a novel framework for joint magnetic resonance image reconstruction and uncertainty quantification using under-sampled k-space measurements. The problem is formulated as a Bayesian linear inverse problem, where prior distributions are assigned to the unknown model parameters. Specifically, we assume the target image is sparse in its spatial gradient and impose a total variation prior model. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then used to sample from the resulting joint posterior distribution of the unknown parameters. Experiments conducted using single- and multi-coil datasets demonstrate the superior performance of the proposed framework over optimisation-based compressed sensing algorithms. Additionally, our framework effectively quantifies uncertainty, showing strong correlation with error maps computed from reconstructed and ground-truth images.

425. 【2603.13422】Projection Guided Personalized Federated Learning for Low Dose CT Denoising

链接https://arxiv.org/abs/2603.13422

作者:Anas Zafar,Muhammad Waqas,Amgad Muneer,Rukhmini Bandyopadhyay,Jia Wu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:reduces radiation exposure, reduces radiation, vary across institutions, Guided Personalized Federated, federated learning

备注: 17 pages, 3 Figures, 6 Tables

点击查看摘要

Abstract:Low-dose CT (LDCT) reduces radiation exposure but introduces protocol-dependent noise and artifacts that vary across institutions. While federated learning enables collaborative training without centralizing patient data, existing methods personalize in image space, making it difficult to separate scanner noise from patient anatomy. We propose ProFed (Projection Guided Personalized Federated Learning), a framework that complements the image space approach by performing dual-level personalization in the projection space, where noise originates during CT measurements before reconstruction combines protocol and anatomy effects. ProFed introduces: (i) anatomy-aware and protocol-aware networks that personalize CT reconstruction to patient and scanner-specific features, (ii) multi-constraint projection losses that enforce consistency with CT measurements, and (iii) uncertainty-guided selective aggregation that weights clients by prediction confidence. Extensive experiments on the Mayo Clinic 2016 dataset demonstrate that ProFed achieves 42.56 dB PSNR with CNN backbones and 44.83 dB with Transformers, outperforming 11 federated learning baselines, including the physics-informed SCAN-PhysFed by +1.42 dB.