本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新491篇论文,其中:

  • 自然语言处理59
  • 信息检索16
  • 计算机视觉146

自然语言处理

1. 【2603.13201】Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

链接https://arxiv.org/abs/2603.13201

作者:Xin Chen,Junchao Wu,Shu Yang,Runzhe Zhan,Zeyu Wu,Min Yang,Shujian Huang,Lidia S. Chao,Derek F. Wong

类目:Computation and Language (cs.CL)

关键词:Instruction Tuning, large language models, neuron activation, effective approach, approach to unlock

备注

点击查看摘要

Abstract:Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10\% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.

2. 【2603.13173】Semantic Invariance in Agentic AI

链接https://arxiv.org/abs/2603.13173

作者:I. de Zarzà,J. de Curtò,Jordi Cabot,Pietro Manzoni,Carlos T. Calafate

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, multi-agent coordination systems, increasingly serve, decision support

备注: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic this http URL benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

3. 【2603.13168】Developing and evaluating a chatbot to support maternal health care

链接https://arxiv.org/abs/2603.13168

作者:Smriti Jha,Vidhi Jain,Jianyu Xu,Grace Liu,Sowmya Ramesh,Jitender Nagpal,Gretchen Chapman,Benjamin Bellows,Siddhartha Goyal,Aarti Singh,Bryan Wilder

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:low health literacy, significant impact, access to care, ability to provide, provide trustworthy maternal

备注: 17 pages; submitted to IJCAI 2026 AI and Social Good Track

点击查看摘要

Abstract:The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

Comments:
17 pages; submitted to IJCAI 2026 AI and Social Good Track

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.13168 [cs.AI]

(or
arXiv:2603.13168v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.13168

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2603.13154】ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

链接https://arxiv.org/abs/2603.13154

作者:Siqi Sun,Ben Peng Wu,Mali Jin,Peizhen Bai,Hanpei Zhang,Xingyi Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:increasingly incorporates environmental, corporate responsibility increasingly, responsibility increasingly incorporates, documenting sustainability practices, assessing firms' long-term

备注: To be published in the AAAI 2026 proceedings

点击查看摘要

Abstract:As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

5. 【2603.13045】Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

链接https://arxiv.org/abs/2603.13045

作者:Yifeng Liu,Siqi Ouyang,Yatish Hosmane Revanasiddappa,Lei Li

类目:Computation and Language (cs.CL)

关键词:demonstrated remarkable capability, high-resource language pairs, demonstrated remarkable, remarkable capability, capability in machine

备注: Our code is available at [this https URL](https://github.com/LeiLiLab/WALAR)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

6. 【2603.13038】Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

链接https://arxiv.org/abs/2603.13038

作者:Hubert Plisiecki,Maria Leniarska,Jan Piotrowski,Marcin Zajenkowski

类目:Computation and Language (cs.CL)

关键词:Supervised Semantic Differential, continuous individual-difference variables, Supervised Semantic, Semantic Differential, mixed quantitative-interpretive method

备注: Submitted to ACL 2026

点击查看摘要

Abstract:Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-difference variables by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. SSD applies PCA before regression, but currently no systematic method exists for choosing the number of retained components, introducing avoidable researcher degrees of freedom in the analysis pipeline. We propose a PCA sweep procedure that treats dimensionality selection as a joint criterion over representation capacity, gradient interpretability, and stability across nearby values of K. We illustrate the method on a corpus of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. The sweep yields a stable, interpretable Admiration-related gradient contrasting optimistic, collaborative framings of AI with distrustful and derisive discourse, while no robust alignment emerges for Rivalry. We also show that a counterfactual using a high-PCA dimension solution heuristic produces diffuse, weakly structured clusters instead, reinforcing the value of the sweep-based choice of K. The case study shows how the PCA sweep constrains researcher degrees of freedom while preserving SSD's interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning.

7. 【2603.13023】daVinci-Env: Open SWE Environment Synthesis at Scale

链接https://arxiv.org/abs/2603.13023

作者:Dayuan Fu,Shenyu Wu,Yunze Wu,Zerui Peng,Yaxing Huang,Jie Sun,Ji Zeng,Mohan Jiang,Lin Zhang,Yukun Li,Jiarui Hu,Liming Liu,Jinlong Hou,Pengfei Liu

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:capable software engineering, provide dynamic feedback, dynamic feedback loops, agents demands large-scale, Training capable software

备注

点击查看摘要

Abstract:Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.

8. 【2603.13017】Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

链接https://arxiv.org/abs/2603.13017

作者:Sydney Lewis

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Long conversations, user conversation history, create a simple, simple problem, agent create

备注: 6 figures. Code: [this https URL](https://github.com/Process-Point-Technologies-Corporation/searchat)

点击查看摘要

Abstract:Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

9. 【2603.12983】Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

链接https://arxiv.org/abs/2603.12983

作者:Boxuan Lyu,Haiyue Song,Zhi Qu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Error Span Detection, Machine Translation, translation errors, subtask in Machine, Span Detection

备注

点击查看摘要

Abstract:Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate this http URL experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

10. 【2603.12963】Long-form RewardBench: Evaluating Reward Models for Long-form Generation

链接https://arxiv.org/abs/2603.12963

作者:Hui Huang,Yancheng He,Wei Liu,Muyun Yang,Jiaheng Liu,Kehai Chen,Bing Xu,Conghui Zhu,Hailong Cao,Tiejun Zhao

类目:Computation and Language (cs.CL)

关键词:reinforcement learning-based alignment, learning-based alignment highlights, reward, widespread adoption, adoption of reinforcement

备注: Accepted by AAAI2026

点击查看摘要

Abstract:The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.

11. 【2603.12932】DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

链接https://arxiv.org/abs/2603.12932

作者:Ruiyao Xu,Noelle I. Samia,Han Liu

类目:Computation and Language (cs.CL)

关键词:Adapting Large Language, Large Language Models, Adapting Large, Large Language, requires high-quality instruction

备注

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.

12. 【2603.12920】HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

链接https://arxiv.org/abs/2603.12920

作者:Zixin Feng,Xinying Cui,Yifan Sun,Zheng Wei,Jiachen Yuan,Jiazhen Hu,Ning Xin,Md Maruf Hasan

类目:Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:multiple categories, social media, media is inherently, abusive behaviors, behaviors often overlap

备注

点击查看摘要

Abstract:Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

13. 【2603.12906】Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

链接https://arxiv.org/abs/2603.12906

作者:Liel Binyamin,Elior Sulem

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:leaving open questions, developmentally plausible language, Research on developmentally, plausible language models, leaving open

备注: Accepted to Findings of EACL 2026

点击查看摘要

Abstract:Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits semantic tasks, whereas child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, with particularly strong improvements for French. Importantly, similar patterns emerge across BabyBERTa, RoBERTa, and LTG-BERT, suggesting consistent trends across architectures.

Comments:
Accepted to Findings of EACL 2026

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.12906 [cs.CL]

(or
arXiv:2603.12906v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.12906

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2603.12872】CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

链接https://arxiv.org/abs/2603.12872

作者:João Silva,Luís Gomes,António Branco

类目:Computation and Language (cs.CL)

关键词:Open Large Language, Large Language Models, Open Large, European Portuguese, Large Language

备注: Accepted at PROPOR 2026

点击查看摘要

Abstract:This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at this https URL.

15. 【2603.12826】Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

链接https://arxiv.org/abs/2603.12826

作者:Xu Guo,Qiming Ge,Jian Tong,Kedi Chen,Jin Zhang,Xiaogui Yang,Xuan Gao,Haijun Lv,Zhihui Lu,Yicheng Zou,Qipeng Guo

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Reinforcement Learning, Large Language, capabilities of Large, Language Models

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.

16. 【2603.12823】Adaptive Vision-Language Model Routing for Computer Use Agents

链接https://arxiv.org/abs/2603.12823

作者:Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, User Interface, Graphical User, translate natural-language instructions, instructions into Graphical

备注

点击查看摘要

Abstract:Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: this https URL.

17. 【2603.12795】SteerRM: Debiasing Reward Models via Sparse Autoencoders

链接https://arxiv.org/abs/2603.12795

作者:Mengyuan Sun,Zhuohao Yu,Weizheng Gu,Shikun Zhang,Wei Ye

类目:Computation and Language (cs.CL)

关键词:superficial stylistic cues, preferring better-presented responses, preferring better-presented, critical components, semantically superior

备注

点击查看摘要

Abstract:Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

18. 【2603.12768】SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

链接https://arxiv.org/abs/2603.12768

作者:Aditya Maheshwari,Amit Gajkeshwar,Kaushal Sharma,Vivek Patel

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, groups fairly, popular source, treats different groups

备注: 14 pages; 3 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) becomes a popular source for religious knowledge, it is important to know if it treats different groups fairly. This study is the first to measure how LLMs handle the differences between the two main sects of Islam: Sunni and Shia. We present a test called SectEval, available in both English and Hindi, consisting of 88 questions, to check the bias-ness of 15 top LLM models, both proprietary and open-weights. Our results show a major inconsistency based on language. In English, many powerful models DeepSeek-v3 and GPT-4o often favored Shia answers. However, when asked the exact same questions in Hindi, these models switched to favoring Sunni answers. This means a user could get completely different religious advice just by changing languages. We also looked at how models react to location. Advanced models Claude-3.5 changed their answers to match the user's country-giving Shia answers to a user from Iran and Sunni answers to a user from Saudi Arabia. In contrast, smaller models (especially in Hindi) ignored the user's location and stuck to a Sunni viewpoint. These findings show that AI is not neutral; its religious ``truth'' changes depending on the language you speak and the country you claim to be from. The data set is available at this https URL

19. 【2603.12754】A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

链接https://arxiv.org/abs/2603.12754

作者:Paul Van Eecke,Katrien Beuls

类目:Computation and Language (cs.CL)

关键词:Fluid Construction Grammar, broad-coverage construction grammars, computational construction grammars, construction grammars, Construction Grammar

备注

点击查看摘要

Abstract:We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

20. 【2603.12743】MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization

链接https://arxiv.org/abs/2603.12743

作者:Chenyang Zhu,Hongxiang Li,Xiu Li,Long Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Knowledge-aware Concept Customization, Concept, target concept, Concept customization, Concept customization typically

备注: Project Page: [this https URL](https://chenyangzhu1.github.io/MoKus/)

点击查看摘要

Abstract:Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.

21. 【2603.12710】AI Planning Framework for LLM-Based Web Agents

链接https://arxiv.org/abs/2603.12710

作者:Orit Shahnovsky,Rotem Dror

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Developing autonomous agents, Large Language Model, Developing autonomous, Tree Search, core challenge

备注

点击查看摘要

Abstract:Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

22. 【2603.12702】FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

链接https://arxiv.org/abs/2603.12702

作者:Chaojie Sun,Bin Cao,Tiantian Li,Chenyu Hou,Ruizhe Li,Qing Fan

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, language models, growing efforts, rapid advancement, made on LLM-based

备注: Under Review - Submitted to SIGIR 2026 Resources Track; 10pages, 5 figures, 4 tables

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), growing efforts have been made on LLM-based table retrieval. However, existing studies typically focus on single-table query, and implement it by similarity matching after encoding the entire table. These methods usually result in low accuracy due to their coarse-grained encoding which incorporates much query-irrelated data, and are also inefficient when dealing with large tables, failing to fully utilize the reasoning capabilities of LLM. Further, multi-table query is under-explored in retrieval tasks. To this end, we propose a hierarchical multi-table query method based on LLM: Fine-Grained Multi-Table Retrieval FGTR, a new retrieval paradigm that employs a human-like reasoning strategy. Through hierarchical reasoning, FGTR first identifies relevant schema elements and then retrieves the corresponding cell contents, ultimately constructing a concise and accurate sub-table that aligns with the given query. To comprehensively evaluate the performance of FGTR, we construct two new benchmark datasets based on Spider and BIRD . Experimental results show that FGTR outperforms previous state-of-the-art methods, improving the F_2 metric by 18% on Spider and 21% on BIRD, demonstrating its effectiveness in enhancing fine-grained retrieval and its potential to improve end-to-end performance on table-based downstream tasks.

23. 【2603.12698】EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

链接https://arxiv.org/abs/2603.12698

作者:Chi Ruan,Dongfu Jiang,Huaye Zeng,Ping Nie,Wenhu Chen

类目:Computation and Language (cs.CL)

关键词:large language models, static verification signals, verifiable rewards, language models, promising approach

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.

24. 【2603.12683】Experimental evidence of progressive ChatGPT models self-convergence

链接https://arxiv.org/abs/2603.12683

作者:Konstantinos F. Xylogiannopoulos,Petros Xanthopoulos,Panagiotis Karampelas,Georgios A. Bakamitsos

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, undergo recursive training, Language Models, undergo recursive

备注

点击查看摘要

Abstract:Large Language Models (LLMs) that undergo recursive training on synthetically generated data are susceptible to model collapse, a phenomenon marked by the generation of meaningless output. Existing research has examined this issue from either theoretical or empirical perspectives, often focusing on a single model trained recursively on its own outputs. While prior studies have cautioned against the potential degradation of LLM output quality under such conditions, no longitudinal investigation has yet been conducted to assess this effect over time. In this study, we employ a text similarity metric to evaluate different ChatGPT models' capacity to generate diverse textual outputs. Our findings indicate a measurable decline of recent ChatGPT releases' ability to produce varied text, even when explicitly prompted to do so, by setting the temperature parameter to one. The observed reduction in output diversity may be attributed to the influence of the amounts of synthetic data incorporated within their training datasets as the result of internet infiltration by LLM generated data. The phenomenon is defined as model self-convergence because of the gradual increase of similarities of produced texts among different ChatGPT versions.

25. 【2603.12677】MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization

链接https://arxiv.org/abs/2603.12677

作者:Shuxin Liu,Ou Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, disrupting general capabilities, precisely rectify specific, rectify specific knowledge

备注: 17 pages, 2 figures

点击查看摘要

Abstract:Knowledge editing (KE) aims to precisely rectify specific knowledge in Large Language Models (LLMs) without disrupting general capabilities. State-of-the-art methods suffer from an open-loop control mismatch. We identify a critical "Semantic-Execution Disconnect": the semantic target is derived independently without feedback from the downstream's feasible region. This misalignment often causes valid semantic targets to fall within the prohibited space, resulting in gradient truncation and editing failure. To bridge this gap, we propose MetaKE (Meta-learning Aligned Knowledge Editing), a new framework that reframes KE as a bi-level optimization problem. Departing from static calculation, MetaKE treats the edit target as a learnable meta-parameter: the upper-level optimizer seeks a feasible target to maximize post-edit performance, while the lower-level solver executes the editing. To address the challenge of differentiating through complex solvers, we derive a Structural Gradient Proxy, which explicitly backpropagates editability constraints to the target learning phase. Theoretical analysis demonstrates that MetaKE automatically aligns the edit direction with the model's feasible manifold. Extensive experiments confirm that MetaKE significantly outperforms strong baselines, offering a new perspective on knowledge editing.

26. 【2603.12664】From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space

链接https://arxiv.org/abs/2603.12664

作者:Lehui Li,Yuyao Wang,Jisheng Yan,Wei Zhang,Jinliang Deng,Haoliang Sun,Zhongyi Han,Yongshun Gong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Incorporating textual information, addressing event-driven non-stationarity, hinders effective fusion, textual descriptions express, modality gap hinders

备注: 15 pages, 6 figures

点击查看摘要

Abstract:Incorporating textual information into time-series forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives (mean shift, volatility, shape, and lag) extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29 percent reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. The code will be released after acceptance.

27. 【2603.12658】Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

链接https://arxiv.org/abs/2603.12658

作者:Hongyang Chen,Zhongwu Sun,Hongfei Ye,Kunchi Li,Xuemin Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:mitigating catastrophic forgetting-a, static pre-training paradigm, pre-training paradigm inherent, catastrophic forgetting-a critical, forgetting-a critical limitation

备注

点击查看摘要

Abstract:Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static pre-training paradigm inherent to modern LLMs. This survey presents a comprehensive overview of CL methodologies tailored for LLMs, structured around three core training stages: continual pre-training, continual fine-tuning, and continual this http URL the canonical taxonomy of rehearsal-, regularization-, and architecture-based methods, we further subdivide each category by its distinct forgetting mitigation mechanisms and conduct a rigorous comparative analysis of the adaptability and critical improvements of traditional CL methods for LLMs. In doing so, we explicitly highlight core distinctions between LLM CL and traditional machine learning, particularly with respect to scale, parameter efficiency, and emergent capabilities. Our analysis covers essential evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. This survey reveals that while current methods demonstrate promising results in specific domains, fundamental challenges persist in achieving seamless knowledge integration across diverse tasks and temporal scales. This systematic review contributes to the growing body of knowledge on LLM adaptation, providing researchers and practitioners with a structured framework for understanding current achievements and future opportunities in lifelong learning for language models.

28. 【2603.12646】98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

链接https://arxiv.org/abs/2603.12646

作者:Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

类目:Computation and Language (cs.CL)

关键词:intercept LLM requests, PII detection, System-level routers, add minimal latency, Stage

备注

点击查看摘要

Abstract:System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.

29. 【2603.12638】Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

链接https://arxiv.org/abs/2603.12638

作者:Necva Bölücü,Jessica Irons,Changhyun Lee,Brian Jin,Maciej Rybinski,Huichen Yang,Andreas Duenser,Stephen Wan

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:knowledge increasingly impractical, structured knowledge increasingly, increasingly impractical, made manual extraction, rapid growth

备注: 17pages, 9 figures, EACL demo track

点击查看摘要

Abstract:The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

30. 【2603.12615】Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior

链接https://arxiv.org/abs/2603.12615

作者:David C. Flynn

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:correct-sounding ethical responses, evaluation frameworks test, moral reasoning capacity, frameworks test, production of correct-sounding

备注: 27 pages, 6 tables. Target: Minds and Machines (Springer)

点击查看摘要

Abstract:Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.

31. 【2603.12582】RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

链接https://arxiv.org/abs/2603.12582

作者:He Zhu,Yanshu Li,Wen Liu,Haitian Yang

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Natural Language Processing, Language Processing, Natural Language, mislead deep learning, threat to Natural

备注: 15 pages, 4 figures

点击查看摘要

Abstract:Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.

32. 【2603.12577】Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

链接https://arxiv.org/abs/2603.12577

作者:Jia-Chen Zhang,Zhen-Wei Yan,Yu-Jie Xiong,Chun-Ming Xia

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:extreme parameter efficiency, Parameter-Efficient Fine-Tuning, multi-task scenarios due, dominant paradigm, paradigm for deploying

备注

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks--where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

33. 【2603.12572】LMEB: Long-horizon Memory Embedding Benchmark

链接https://arxiv.org/abs/2603.12572

作者:Xinping Zhao,Xinshuo Hu,Jiaxin Xu,Danyu Tang,Xin Zhang,Mengjia Zhou,Yan Zhong,Yao Zhou,Zifei Shan,Meishan Zhang,Baotian Hu,Min Zhang

类目:Computation and Language (cs.CL)

关键词:long-horizon memory retrieval, memory retrieval tasks, temporally distant information, memory retrieval, long-horizon memory

备注: 35 pages, 9 figures, 23 tables

点击查看摘要

Abstract:Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at this https URL.

34. 【2603.12565】Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization

链接https://arxiv.org/abs/2603.12565

作者:Mengjie Zhao,Lianbo Liu,Yusuke Fujita,Hao Shi,Yuan Gao,Roman Koshkin,Yui Sudo

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:text-based LLM backbones, typically combine ASR-trained, combine ASR-trained encoders, LLM backbones, text-based LLM

备注

点击查看摘要

Abstract:SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.

35. 【2603.12564】AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

链接https://arxiv.org/abs/2603.12564

作者:Zekun Wu,Adriano Koshiyama,Sahan Bulathwela,Maria Perez-Ortiz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Tool-augmented LLM agents, Tool-augmented LLM, agents increasingly serve, LLM agents increasingly, increasingly serve

备注: 50 pages, 31 tables, 15 figures. Under review at COLM 2026

点击查看摘要

Abstract:Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

36. 【2603.12554】Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

链接https://arxiv.org/abs/2603.12554

作者:Vishnu Teja Kunde,Fatemeh Doudi,Mahdi Farahbakhsh,Dileep Kalathil,Krishna Narayanan,Jean-Francois Chamberland

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reinforcement learning, intractable sequence-level likelihoods, language models, diffusion language models, extending these methods

备注

点击查看摘要

Abstract:Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at this https URL.

37. 【2603.12529】ERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2603.12529

作者:Alliot Nagle,Jakhongir Saydaliev,Dhia Garbaya,Michael Gastpar,Ashok Vardhan Makkuva,Hyeji Kim

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Reasoning Models, generate intermediate thinking, intermediate thinking tokens, Large Reasoning, Reasoning Models

备注: 35 pages, 31 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

38. 【2603.12522】LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation

链接https://arxiv.org/abs/2603.12522

作者:Himel Ghosh,Nick Elias Werner

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:large language models, LLM BiasScope, bias, LLM, deployed widely

备注: Accepted at EACL 2026 (24-29 March, Morocco)

点击查看摘要

Abstract:As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns. LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions. The system is built on this http URL with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.

39. 【2603.12520】When LLM Judge Scores Look Good but Best-of-N Decisions Fail

链接https://arxiv.org/abs/2603.12520

作者:Eddie Landesberg

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, score candidate responses, single global metric, Large language, candidate responses

备注

点击查看摘要

Abstract:Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.12520 [cs.LG]

(or
arXiv:2603.12520v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.12520

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
40. 【2603.12510】Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

链接https://arxiv.org/abs/2603.12510

作者:Siddharth Srikanth,Freddie Liang,Sophie Hsu,Varun Bhatt,Shihan Zhao,Henry Chen,Bryon Tjanaka,Minjune Hwang,Akanksha Saran,Daniel Seita,Aaquib Tabrez,Stefanos Nikolaidis

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enable general-purpose robotic, general-purpose robotic systems, Quality Diversity, Q-DIG, significant potential

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at this http URL.

41. 【2603.12471】Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing Feedback

链接https://arxiv.org/abs/2603.12471

作者:Mei Tan,Lena Phalen,Dorottya Demszky

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Effective personalized feedback, students' literacy development, Effective personalized, literacy development, critical to students'

备注: To appear in LAK 2026

点击查看摘要

Abstract:Effective personalized feedback is critical to students' literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how "personalization" shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes--even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias--overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.

42. 【2603.12458】Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

链接https://arxiv.org/abs/2603.12458

作者:Xing Zi,Xinying Zhou,Jinghao Xiao,Catarina Moreira,Mukesh Prasad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, single-hop factual recall, real-world clinical settings, achieve expert-level performance

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: this https URL

43. 【2603.12453】CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

链接https://arxiv.org/abs/2603.12453

作者:Christos Tzouvaras,Konstantinos Skianis,Athanasios Voulodimos

类目:Computation and Language (cs.CL)

关键词:Clear Reply, Clear Non-Reply, Deliberative Complexity Gating, Clear, paper describes

备注

点击查看摘要

Abstract:This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

44. 【2603.12423】Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis

链接https://arxiv.org/abs/2603.12423

作者:Abdullah Al Mofael,Lisa M. Kuhn,Ghassan Alkadi,Kuo-Pao Yang

类目:Computation and Language (cs.CL)

关键词:modern language models, causing reversed meanings, factual errors, persistent challenge, challenge for modern

备注: 9 pages, 4 figures, 1 table. Accepted at the 2026 IEEE 16th Annual Computing and Communication Workshop and Conference (CCWC)

点击查看摘要

Abstract:Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model's sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific attention heads were temporarily disabled to observe how logical polarity changed. Together, these steps revealed how negation signals move and evolve through GPT-2's layers. Our findings indicate that this capability is not widespread; instead, it is highly concentrated within a limited number of mid-layer attention heads, primarily within layers 4 to 6. Ablating these specific components directly disrupts the model's negation sensitivity: on our in-domain, ablation increased NES (indicating weaker negation sensitivity), and re-introducing cached affirmative activations (rescue) increased NES further, confirming that these heads carry affirmative signal rather than restoring baseline behavior. On xNot360, ablation slightly decreased NES and rescue restored performance above baseline. This pattern demonstrates that these causal patterns are consistent across various negation forms and remain detectable on the external xNot360 benchmark, though with smaller magnitude.

45. 【2603.12397】Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

链接https://arxiv.org/abs/2603.12397

作者:Pengcheng Wen,Yanxu Zhu,Jiapeng Sun,Han Zhu,Yujin Zhou,Chi-Min Chan,Sirui Han,Yike Guo

类目:Computation and Language (cs.CL)

关键词:recent work suggests, LLM decision-making, window into LLM, post-hoc rationalization, recent work

备注

点击查看摘要

Abstract:Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning's causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B--14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.

46. 【2603.12378】NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation

链接https://arxiv.org/abs/2603.12378

作者:Yuxin Yang,Haoran Zhang,Mingxuan Li,Jiachen Xu,Ruoxi Shen,Zhenyu Wang,Tianhao Liu,Siqi Chen,Weilin Huang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:adapting Large Language, Large Language Models, Large Language, adapting Large, Parameter-Efficient Fine-Tuning

备注: work in progress

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly Low-Rank Adaptation (LoRA), have become essential for adapting Large Language Models (LLMs) to downstream tasks. While the recent FlyLoRA framework successfully leverages bio-inspired sparse random projections to mitigate parameter interference, it relies on a static, magnitude-based routing mechanism that is agnostic to input context. In this paper, we propose NeuroLoRA, a novel Mixture-of-Experts (MoE) based LoRA framework inspired by biological neuromodulation -- the dynamic regulation of neuronal excitability based on context. NeuroLoRA retains the computational efficiency of frozen random projections while introducing a lightweight, learnable neuromodulation gate that contextually rescales the projection space prior to expert selection. We further propose a Contrastive Orthogonality Loss to explicitly enforce separation between expert subspaces, enhancing both task decoupling and continual learning capacity. Extensive experiments on MMLU, GSM8K, and ScienceQA demonstrate that NeuroLoRA consistently outperforms FlyLoRA and other strong baselines across single-task adaptation, multi-task model merging, and sequential continual learning scenarios, while maintaining comparable parameter efficiency.

47. 【2603.12372】Efficient Reasoning with Balanced Thinking

链接https://arxiv.org/abs/2603.12372

作者:Yulin Li,Tengyao Tu,Li Ding,Junjie Wang,Huiling Zhen,Yixin Chen,Yong Li,Zhuotao Tian

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:expending redundant computational, redundant computational steps, remarkable reasoning capabilities, Large Reasoning Models, shown remarkable reasoning

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code is available at this https URL .

48. 【2603.12368】Multi-Step Semantic Reasoning in Generative Retrieval

链接https://arxiv.org/abs/2603.12368

作者:Steven Dong,Yubao Tang,Maarten de Rijke

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Generative retrieval, relevant document identifiers, document identifiers directly, generate relevant document, encode a corpus

备注: Accepted at ECIR2026

点击查看摘要

Abstract:Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve the learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.

49. 【2603.12350】ASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

链接https://arxiv.org/abs/2603.12350

作者:Liang-Hsuan Tseng,Hung-yi Lee

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:spoken language modeling, intelligent speech-based interactions, Text-speech joint spoken, joint spoken language, language modeling

备注: Work in progress

点击查看摘要

Abstract:Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

50. 【2603.12343】LLM-Augmented Therapy Normalization and Aspect-Based Sentiment Analysis for Treatment-Resistant Depression on Reddit

链接https://arxiv.org/abs/2603.12343

作者:Yuxin Zhu,Sahithi Lakamana,Masoud Rouhizadeh,Selen Bozkurt,Rachel Hershenberg,Abeed Sarker

类目:Computation and Language (cs.CL)

关键词:major depressive disorder, Treatment-resistant depression, multiple adequate treatment, severe form, form of major

备注

点击查看摘要

Abstract:Treatment-resistant depression (TRD) is a severe form of major depressive disorder in which patients do not achieve remission despite multiple adequate treatment trials. Evidence across pharmacologic options for TRD remains limited, and trials often do not fully capture patient-reported tolerability. Large-scale online peer-support narratives therefore offer a complementary lens on how patients describe and evaluate medications in real-world use. In this study, we curated a corpus of 5,059 Reddit posts explicitly referencing TRD from 3,480 subscribers across 28 mental health-related subreddits from 2010 to 2025. Of these, 3,839 posts mentioned at least one medication, yielding 23,399 mentions of 81 generic-name medications after lexicon-based normalization of brand names, misspellings, and colloquialisms. We developed an aspect-based sentiment classifier by fine-tuning DeBERTa-v3 on the SMM4H 2023 therapy-sentiment Twitter corpus with large language model based data augmentation, achieving a micro-F1 score of 0.800 on the shared-task test set. Applying this classifier to Reddit, we quantified sentiment toward individual medications across three categories: positive, neutral, and negative, and tracked patterns by drug, subscriber, subreddit, and year. Overall, 72.1% of medication mentions were neutral, 14.8% negative, and 13.1% positive. Conventional antidepressants, especially SSRIs and SNRIs, showed consistently higher negative than positive proportions, whereas ketamine and esketamine showed comparatively more favorable sentiment profiles. These findings show that normalized medication extraction combined with aspect-based sentiment analysis can help characterize patient-perceived treatment experiences in TRD-related Reddit discourse, complementing clinical evidence with large-scale patient-generated perspectives.

51. 【2603.12287】Context-Enriched Natural Language Descriptions of Vessel Trajectories

链接https://arxiv.org/abs/2603.12287

作者:Kostas Patroumpas,Alexandros Troupiotis-Kapeliaris,Giannis Spiliopoulos,Panagiotis Betchavas,Dimitrios Skoutas,Dimitris Zissis,Nikos Bikakis

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)

关键词:transforming raw vessel, raw vessel trajectory, machine reasoning systems, address the problem, problem of transforming

备注

点击查看摘要

Abstract:We address the problem of transforming raw vessel trajectory data collected from AIS into structured and semantically enriched representations interpretable by humans and directly usable by machine reasoning systems. We propose a context-aware trajectory abstraction framework that segments noisy AIS sequences into distinct trips each consisting of clean, mobility-annotated episodes. Each episode is further enriched with multi-source contextual information, such as nearby geographic entities, offshore navigation features, and weather conditions. Crucially, such representations can support generation of controlled natural language descriptions using LLMs. We empirically examine the quality of such descriptions generated using several LLMs over AIS data along with open contextual features. By increasing semantic density and reducing spatiotemporal complexity, this abstraction can facilitate downstream analytics and enable integration with LLMs for higher-level maritime reasoning tasks.

52. 【2603.12277】Prompt Injection as Role Confusion

链接https://arxiv.org/abs/2603.12277

作者:Charles Ye,Jasmine Cui,Dylan Hadfield-Menell

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Language models remain, extensive safety training, models remain vulnerable, Language models, safety training

备注

点击查看摘要

Abstract:Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

53. 【2603.12275】GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping

链接https://arxiv.org/abs/2603.12275

作者:Chahana Dahal,Ashutosh Balasubramaniam,Zuobin Xiong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, digest training data, task in Large

备注

点击查看摘要

Abstract:Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS's superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at this https URL.

54. 【2603.12274】DIALECTIC: A Multi-Agent System for Startup Evaluation

链接https://arxiv.org/abs/2603.12274

作者:Jae Yoon Bae,Simon Malberg,Joyce Galang,Andre Retterath,Georg Groh

类目:Multiagent Systems (cs.MA); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词:Venture capital, ending up successful, face a large, fewer ending, large number

备注: Accepted at EACL 2026 Industry Track

点击查看摘要

Abstract:Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.

55. 【2603.12273】Aligning Language Models from User Interactions

链接https://arxiv.org/abs/2603.12273

作者:Thomas Kleine Buening,Jonas Hübotter,Barna Pásztor,Idan Shenfeld,Giorgia Ramponi,Andreas Krause

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:abundant data produced, Multi-turn user interactions, lack effective methods, Multi-turn user, abundant data

备注

点击查看摘要

Abstract:Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user's preferences. Importantly, language models are already able to make use of this information in context. After observing a user's follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. By conditioning the model on the user's follow-up message and comparing the resulting token distribution with the original policy, we obtain a target for updating the policy that captures how the model's behavior changes in hindsight. We then distill this hindsight distribution back into the current policy. Remarkably, we show that training on real-world user conversations from WildChat improves language models across standard alignment and instruction-following benchmarks, without regressing other capabilities. The same mechanism enables personalization, allowing models to continually adapt to individual users through interaction without explicit feedback. Our results demonstrate that raw user interactions that arise naturally during deployment enable alignment, personalization, and continual adaptation.

56. 【2603.12272】ActTail: Global Activation Sparsity in Large Language Models

链接https://arxiv.org/abs/2603.12272

作者:Wenwen Hou,Xinyuan Song,Shiwei Liu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:accelerating large language, Activation sparsity, large language model, inference by reducing, memory movement

备注

点击查看摘要

Abstract:Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.

57. 【2603.12271】Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

链接https://arxiv.org/abs/2603.12271

作者:Boyu Qiao,Sean Guo,Xian Yang,Kun Li,Wei Zhou,Songlin Hu,Yunya Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:revised multiple times, knowledge-intensive tasks, revised multiple, multiple times, Dynamic Knowledge Instance

备注

点击查看摘要

Abstract:LLMs are widely used in knowledge-intensive tasks where the same fact may be revised multiple times within context. Unlike prior work focusing on one-shot updates or single conflicts, multi-update scenarios contain multiple historically valid versions that compete at retrieval, yet remain underexplored. This challenge resembles the AB-AC interference paradigm in cognitive psychology: when the same cue A is successively associated with B and C, the old and new associations compete during retrieval, leading to bias. Inspired by this, we introduce a Dynamic Knowledge Instance (DKI) evaluation framework, modeling multi-updates of the same fact as a cue paired with a sequence of updated values, and assess models via endpoint probing of the earliest (initial) and latest (current) states. Across diverse LLMs, we observe that retrieval bias intensifies as updates increase, earliest-state accuracy stays high while latest-state accuracy drops substantially. Diagnostic analyses of attention, hidden-state similarity, and output logits further reveal that these signals become flatter and weakly discriminative on errors, providing little stable basis for identifying the latest update. Finally, cognitively inspired heuristic intervention strategies yield only modest gains and do not eliminate the bias. Our results reveal a persistent challenge in tracking and following knowledge updates in long contexts.

58. 【2603.12270】ask-Specific Knowledge Distillation via Intermediate Probes

链接https://arxiv.org/abs/2603.12270

作者:Ryan Brown,Chris Russell

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Knowledge distillation, Knowledge, method, teacher, high-quality training signal

备注

点击查看摘要

Abstract:Knowledge distillation from large language models (LLMs) assumes that the teacher's output distribution is a high-quality training signal. On reasoning tasks, this assumption is frequently violated. A model's intermediate representations may encode the correct answer, yet this information is lost or distorted through the vocabulary projection, where prompt formatting and answer-token choices creates brittle, noisy outputs. We introduce \method{}, a distillation framework that bypasses this bottleneck by training lightweight probes on frozen teacher hidden states and using the probe's predictions, rather than output logits, as supervision for student training. This simple change yields consistent improvements across four reasoning benchmarks (AQuA-RAT, ARC Easy/Challenge, and MMLU), with gains most pronounced under limited data. Probes trained on intermediate representations provide cleaner labels than the teacher's own outputs, effectively denoising the distillation signal. \method{} requires no architectural changes to student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached. By exploiting internal representations, \method{} enables practitioners to extract more value from large teacher models without additional training data or architectural complexity.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.12270 [cs.CL]

(or
arXiv:2603.12270v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.12270

Focus to learn more

              arXiv-issued DOI via DataCite</p>
59. 【2603.12642】Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

链接https://arxiv.org/abs/2603.12642

作者:Kwanghee Choi,Eunjung Yeo,Cheol Jun Cho,David R. Mortensen,David Harwath

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Transformer-based self-supervised speech, self-supervised speech models, entails remains unclear, Transformer-based self-supervised, speech models

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings advance our understanding of context-dependent S3M representations.

信息检索

1. 【2603.13168】Developing and evaluating a chatbot to support maternal health care

链接https://arxiv.org/abs/2603.13168

作者:Smriti Jha,Vidhi Jain,Jianyu Xu,Grace Liu,Sowmya Ramesh,Jitender Nagpal,Gretchen Chapman,Benjamin Bellows,Siddhartha Goyal,Aarti Singh,Bryan Wilder

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:low health literacy, significant impact, access to care, ability to provide, provide trustworthy maternal

备注: 17 pages; submitted to IJCAI 2026 AI and Social Good Track

点击查看摘要

Abstract:The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

Comments:
17 pages; submitted to IJCAI 2026 AI and Social Good Track

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.13168 [cs.AI]

(or
arXiv:2603.13168v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.13168

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2603.13099】Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

链接https://arxiv.org/abs/2603.13099

作者:Wayner Barrios,SouYoung Jin

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:evaluates multimodal reasoning, verifiable intermediate steps, instances that evaluates, evaluates multimodal, verifiable intermediate

备注

点击查看摘要

Abstract:We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

3. 【2603.13017】Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

链接https://arxiv.org/abs/2603.13017

作者:Sydney Lewis

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Long conversations, user conversation history, create a simple, simple problem, agent create

备注: 6 figures. Code: [this https URL](https://github.com/Process-Point-Technologies-Corporation/searchat)

点击查看摘要

Abstract:Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

4. 【2603.12935】Can Fairness Be Prompted? Prompt-Based Debiasing Strategies in High-Stakes Recommendations

链接https://arxiv.org/abs/2603.12935

作者:Mihaela Rotar,Theresia Veronika Rampisela,Maria Maistro

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, Language Models, potentially biasing recommendations, potentially biasing

备注

点击查看摘要

Abstract:Large Language Models (LLMs) can infer sensitive attributes such as gender or age from indirect cues like names and pronouns, potentially biasing recommendations. While several debiasing methods exist, they require access to the LLMs' weights, are computationally costly, and cannot be used by lay users. To address this gap, we investigate implicit biases in LLM Recommenders (LLMRecs) and explore whether prompt-based strategies can serve as a lightweight and easy-to-use debiasing approach. We contribute three bias-aware prompting strategies for LLMRecs. To our knowledge, this is the first study on prompt-based debiasing approaches in LLMRecs that focuses on group fairness for users. Our experiments with 3 LLMs, 4 prompt templates, 9 sensitive attribute values, and 2 datasets show that our proposed debiasing approach, which instructs an LLM to be fair, can improve fairness by up to 74% while retaining comparable effectiveness, but might overpromote specific demographic groups in some cases.

5. 【2603.12824】NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

链接https://arxiv.org/abs/2603.12824

作者:Zhuchenyang Liu,Yao Zhang,Yu Xiao

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Vision-Language Model, visual document retrieval, based retrievers, advanced visual document, retrievers have advanced

备注

点击查看摘要

Abstract:Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

6. 【2603.12752】aming the Long Tail: Efficient Item-wise Sharpness-Aware Minimization for LLM-based Recommender Systems

链接https://arxiv.org/abs/2603.12752

作者:Jiaming Zhang,Yuyuan Li,Xiaohua Feng,Li Zhang,Longfei Li,Jun Zhou,Chaochao Chen

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Model-based, Model-based Recommender Systems, Language Model-based Recommender, Large Language, Recommender Systems

备注

点击查看摘要

Abstract:Large Language Model-based Recommender Systems (LRSs) have recently emerged as a new paradigm in sequential recommendation by directly adopting LLMs as backbones. While LRSs demonstrate strong knowledge utilization and instruction-following abilities, they have not been systematically studied under the long-standing long-tail problem. In this paper, we conduct an empirical study and reveal that LRSs face two distinct types of long-tail: i) prior long-tail, inherited implicitly from pretraining corpora, and ii) data long-tail, originating from skewed recommendation datasets. Our analysis shows that both contribute to the performance disparity between head and tail items, with the intersection of the two heads exhibiting an even stronger head effect. Nevertheless, the overall performance distribution in LRSs, especially on the tail, remains dominated by the data long-tail. To address this challenge, we propose Efficient Item-wise Sharpness-Aware Minimization (EISAM), a novel optimization framework that improves tail-item performance by adaptively regularizing the loss landscape at the item level. EISAM introduces an efficient penalty design that captures fine-grained item-specific sharpness while maintaining computational scalability for LLMs. In addition, we derive a generalization bound for EISAM. Our theoretical analysis shows that the bound decreases at a faster rate under our item-wise regularization, offering theoretical support for its effectiveness. Extensive experiments on three real-world datasets demonstrate that EISAM significantly boosts tail-item recommendation performance while preserving overall quality, establishing the first systematic solution to the long-tail problem in LRSs.

7. 【2603.12726】Anchored Alignment: Preventing Positional Collapse in Multimodal Recommender Systems

链接https://arxiv.org/abs/2603.12726

作者:Yonghun Jeong,David Yoon Suk Kang,Yeon-Chang Lee

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Multimodal recommender systems, leverage images, enrich item representations, recommender systems, MMRS

备注: 5 pages, 5 figures

点击查看摘要

Abstract:Multimodal recommender systems (MMRS) leverage images, text, and interaction signals to enrich item representations. However, recent alignment based MMRSs that enforce a unified embedding space often blur modality specific structures and exacerbate ID dominance. Therefore, we propose AnchorRec, a multimodal recommendation framework that performs indirect, anchor based alignment in a lightweight projection domain. By decoupling alignment from representation learning, AnchorRec preserves each modality's native structure while maintaining cross modal consistency and avoiding positional collapse. Experiments on four Amazon datasets show that AnchorRec achieves competitive top N recommendation accuracy, while qualitative analyses demonstrate improved multimodal expressiveness and coherence. The codebase of AnchorRec is available at this https URL.

8. 【2603.12702】FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

链接https://arxiv.org/abs/2603.12702

作者:Chaojie Sun,Bin Cao,Tiantian Li,Chenyu Hou,Ruizhe Li,Qing Fan

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, language models, growing efforts, rapid advancement, made on LLM-based

备注: Under Review - Submitted to SIGIR 2026 Resources Track; 10pages, 5 figures, 4 tables

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), growing efforts have been made on LLM-based table retrieval. However, existing studies typically focus on single-table query, and implement it by similarity matching after encoding the entire table. These methods usually result in low accuracy due to their coarse-grained encoding which incorporates much query-irrelated data, and are also inefficient when dealing with large tables, failing to fully utilize the reasoning capabilities of LLM. Further, multi-table query is under-explored in retrieval tasks. To this end, we propose a hierarchical multi-table query method based on LLM: Fine-Grained Multi-Table Retrieval FGTR, a new retrieval paradigm that employs a human-like reasoning strategy. Through hierarchical reasoning, FGTR first identifies relevant schema elements and then retrieves the corresponding cell contents, ultimately constructing a concise and accurate sub-table that aligns with the given query. To comprehensively evaluate the performance of FGTR, we construct two new benchmark datasets based on Spider and BIRD . Experimental results show that FGTR outperforms previous state-of-the-art methods, improving the F_2 metric by 18% on Spider and 21% on BIRD, demonstrating its effectiveness in enhancing fine-grained retrieval and its potential to improve end-to-end performance on table-based downstream tasks.

9. 【2603.12625】VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

链接https://arxiv.org/abs/2603.12625

作者:Ty Valencia,Burak Barlas,Varun Singhal,Ruchir Bhatia,Wei Yang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:commonly framed, signals are combined, Multimodal recommendation, Multimodal, feature fusion problem

备注: 13 pages, 4 figures, 1 table

点击查看摘要

Abstract:Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at this https URL.

10. 【2603.12608】InterDeepResearch: Enabling Human-Agent Collaborative Information Seeking through Interactive Deep Research

链接https://arxiv.org/abs/2603.12608

作者:Bo Pan,Lunke Pan,Yitao Zhou,Qi Jiang,Zhen Wen,Minfeng Zhu,Wei Chen

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词:massive-scale web sources, transformed complex information, iterative retrieval, web sources, agents have transformed

备注

点击查看摘要

Abstract:Deep research systems powered by LLM agents have transformed complex information seeking by automating the iterative retrieval, filtering, and synthesis of insights from massive-scale web sources. However, existing systems predominantly follow an autonomous "query-to-report" paradigm, limiting users to a passive role and failing to integrate their personal insights, contextual knowledge, and evolving research intents. This paper addresses the lack of human-in-the-loop collaboration in the agentic research process. Through a formative study, we identify that current systems hinder effective human-agent collaboration in terms of process observability, real-time steerability, and context navigation efficiency. Informed by these findings, we propose InterDeepResearch, an interactive deep research system backed by a dedicated research context management framework. The framework organizes research context into a hierarchical architecture with three levels (information, actions, and sessions), enabling dynamic context reduction to prevent LLM context exhaustion and cross-action backtracing for evidence provenance. Built upon this framework, the system interface integrates three coordinated views for visual sensemaking, and dedicated interaction mechanisms for interactive research context navigation. Evaluation on the Xbench-DeepSearch-v1 and Seal-0 benchmarks shows that InterDeepResearch achieves competitive performance compared to state-of-the-art deep research systems, while a formal user study demonstrates its effectiveness in supporting human-agent collaborative information seeking. Project page with system demo: this https URL.

11. 【2603.12586】Deferred is Better: A Framework for Multi-Granularity Deferred Interaction of Heterogeneous Features

链接https://arxiv.org/abs/2603.12586

作者:Yi Xu,Moyu Zhang,Chaofan Fan,Jinxin Hu,Yu Zhang,Xiaoyi Zeng

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Click-through rate, prediction models estimates, vast feature space, features, estimates the probability

备注

点击查看摘要

Abstract:Click-through rate (CTR) prediction models estimates the probability of a user-item click by modeling interactions across a vast feature space. A fundamental yet often overlooked challenge is the inherent heterogeneity of these features: their sparsity and information content vary dramatically. For instance, categorical features like item IDs are extremely sparse, whereas numerical features like item price are relatively dense. Prevailing CTR models have largely ignored this heterogeneity, employing a uniform feature interaction strategy that inputs all features into the interaction layers simultaneously. This approach is suboptimal, as the premature introduction of low-information features can inject significant noise and mask the signals from information-rich features, which leads to model collapse and hinders the learning of robust representations. To address the above challenge, we propose a Multi-Granularity Information-Aware Deferred Interaction Network (MGDIN), which adaptively defers the introduction of features into the feature interaction process. MGDIN's core mechanism operates in two stages: First, it employs a multi-granularity feature grouping strategy to partition the raw features into distinct groups with more homogeneous information density in different granularities, thereby mitigating the effects of extreme individual feature sparsity and enabling the model to capture feature interactions from diverse perspectives. Second, a delayed interaction mechanism is implemented through a hierarchical masking strategy, which governs when and how each group participates by masking low-information groups in the early layers and progressively unmasking them as the network deepens. This deferred introduction allows the model to establish a robust understanding based on high-information features before gradually incorporating sparser information from other groups...

12. 【2603.12578】Bridging Sequential and Contextual Features with a Dual-View of Fine-grained Core-Behaviors and Global Interest-Distribution

链接https://arxiv.org/abs/2603.12578

作者:Yi Xu,Chaofan Fan,Moyu Zhang,Jinxin Hu,Jiahao Wang,Hao Zhang,Shizhun Wang,Yu Zhang,Xiaoyi Zeng

类目:Information Retrieval (cs.IR)

关键词:contextual features, tasks typically estimate, dynamically reflects real-time, reflects real-time shifts, Click-through rate

备注

点击查看摘要

Abstract:Click-through rate (CTR) prediction tasks typically estimate the probability of a user clicking on a candidate item by modeling both user behavior sequence features and the item's contextual features, where the user behavior sequence is particularly critical as it dynamically reflects real-time shifts in user interest. Traditional CTR models often aggregate this dynamic sequence into a single vector before interacting it with contextual features. This approach, however, not only leads to behavior information loss during aggregation but also severely limits the model's capacity to capture interactions between contextual features and specific user behaviors, ultimately impairing its ability to capture fine-grained behavioral details and hindering models' prediction accuracy. Conversely, a naive approach of directly interacting with each user action with contextual features is computationally expensive and introduces significant noise from behaviors irrelevant to the candidate item. This noise tends to overwhelm the valuable signals arising from interactions involving more behaviors relevant to the candidate item. Therefore, to resolve the above issue, we propose a Core-Behaviors and Distributional-Compensation Dual-View Interaction Network (CDNet), which bridges the gap between sequential and contextual feature interactions from two complementary angles: a fine-grained interaction involving the most relevant behaviors and contextual features, and a coarse-grained interaction that models the user's overall interest distribution against the contextual features. By simultaneously capturing important behavioral details without forgoing the holistic user interest, CDNet effectively models the interplay between sequential and contextual features without imposing a significant computational burden. Ultimately, extensive experiments validate the effectiveness of CDNet.

13. 【2603.12396】st-Time Strategies for More Efficient and Accurate Agentic RAG

链接https://arxiv.org/abs/2603.12396

作者:Brian Zhang,Deepti Guntur,Zhiyang Zuo,Abhinav Sharma,Shreyas Chaudhari,Wenlong Zhao,Franck Dernoncourt,Puneet Mathur,Ryan Rossi,Nedim Lipka

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:systems face challenges, systems face, operates iteratively, address these complexities, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.12396 [cs.IR]

(or
arXiv:2603.12396v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.12396

Focus to learn more

              arXiv-issued DOI via DataCite</p>
14. 【2603.12368】Multi-Step Semantic Reasoning in Generative Retrieval

链接https://arxiv.org/abs/2603.12368

作者:Steven Dong,Yubao Tang,Maarten de Rijke

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Generative retrieval, relevant document identifiers, document identifiers directly, generate relevant document, encode a corpus

备注: Accepted at ECIR2026

点击查看摘要

Abstract:Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve the learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.

15. 【2603.12290】Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning

链接https://arxiv.org/abs/2603.12290

作者:Huidong Wu,Haojia Xiang,Jingtong Gao,Xiangyu Zhao,Dengsheng Wu,Jianping Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Scholarly web, Scholarly, miscitation, Learning-based Miscitation Detector, miscitation detection

备注

点击查看摘要

Abstract:Scholarly web is a vast network of knowledge connected by citations. However, this system is increasingly compromised by miscitation, where references do not support or even contradict the claims they are cited for. Current miscitation detection methods, which primarily rely on semantic similarity or network anomalies, struggle to capture the nuanced relationship between a citation's context and its place in the wider network. While large language models (LLMs) offer powerful capabilities in semantic reasoning for this task, their deployment is hindered by hallucination risks and high computational costs. In this work, we introduce LLM-Augmented Graph Learning-based Miscitation Detector (LAGMiD), a novel framework that leverages LLMs for deep semantic reasoning over citation graphs and distills this knowledge into graph neural networks (GNNs) for efficient and scalable miscitation detection. Specifically, LAGMiD introduces an evidence-chain reasoning mechanism, which uses chain-of-thought prompting, to perform multi-hop citation tracing and assess semantic fidelity. To reduce LLM inference costs, we design a knowledge distillation method aligning GNN embeddings with intermediate LLM reasoning states. A collaborative learning strategy further routes complex cases to the LLM while optimizing the GNN for structure-based generalization. Experiments on three real-world benchmarks show that LAGMiD achieves state-of-the-art miscitation detection with significantly reduced inference cost.

16. 【2603.12282】Algorithmic Trust and Compliance: Benchmarking Brand Notability for UK iGaming Entities in Generative Search Engines

链接https://arxiv.org/abs/2603.12282

作者:Julen Oruesagasti

类目:Information Retrieval (cs.IR)

关键词:reshaping information retrieval, fundamentally reshaping information, Generative Engine Optimization, Search Engine Optimization, AI-powered search engines

备注: Technical Report. Produced by Interamplify Research Division (UK)

点击查看摘要

Abstract:The rapid adoption of generative AI-powered search engines, such as ChatGPT, Perplexity, and Gemini, is fundamentally reshaping information retrieval. We are witnessing a critical shift from traditional ranked lists to synthesized, citation-backed answers. This paradigm shift challenges established Search Engine Optimization (SEO) practices and necessitates a new framework, termed Generative Engine Optimization (GEO). In highly regulated environments like the UK iGaming sector, visibility is no longer dictated by keyword density, but by an entity's ability to project "Algorithmic Trust". This report presents an empirical analysis of how compliance signals -- such as UK Gambling Commission (UKGC) standards -- function as authority multipliers for Large Language Models (LLMs) when properly structured. Recent large-scale experiments reveal that AI Search exhibits a systematic and overwhelming bias towards Earned media (third-party, authoritative sources) over Brand-owned content. Consequently, practitioners must engineer their content for machine scannability and justification to dominate these new AI-perceived authority metrics.

计算机视觉

1. 【2603.13228】PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

链接https://arxiv.org/abs/2603.13228

作者:Yangsong Zhang,Anujith Muraleedharan,Rikhat Akizhanov,Abdul Ahad Butt,Gül Varol,Pascal Fua,Fabio Pizzati,Ivan Laptev

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:human motion data, human motion generation, text-conditioned human motion, large-scale human motion, text-conditioned human

备注

点击查看摘要

Abstract:Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.

2. 【2603.13227】Representation Learning for Spatiotemporal Physical Systems

链接https://arxiv.org/abs/2603.13227

作者:Helen Qu,Rudy Morel,Michael McCabe,Alberto Bietti,François Lanusse,Shirley Ho,Yann LeCun

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Machine learning approaches, evolution in time, Machine learning, approaches to spatiotemporal, primarily focused

备注: Published at ICLR 2026 Workshop on AI PDE

点击查看摘要

Abstract:Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system's evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system's governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general-purpose self-supervised methods in learning physics-grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self-supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel-level prediction objectives. Code is available at this https URL.

3. 【2603.13224】Visual-ERM: Reward Modeling for Visual Equivalence

链接https://arxiv.org/abs/2603.13224

作者:Ziyu Liu,Shengyuan Ding,Xinyu Fang,Xuanlang Dai,Penghui Yang,Jianze Liang,Jiaqi Wang,Kai Chen,Dahua Lin,Yuhang Zang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision Language, Vision Language Models, high visual fidelity, recent Large Vision, representations with high

备注: Project: [this https URL](https://github.com/InternLM/Visual-ERM)

点击查看摘要

Abstract:Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

4. 【2603.13215】Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

链接https://arxiv.org/abs/2603.13215

作者:Ziqi Ma,Mengzhan Liufu,Georgia Gkioxari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video world models, Video world, world models, ice melting, world

备注: [this https URL](https://glab-caltech.github.io/STEVOBench/)

点击查看摘要

Abstract:Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: this https URL. Blog: this https URL

5. 【2603.13185】owards Spatio-Temporal World Scene Graph Generation from Monocular Videos

链接https://arxiv.org/abs/2603.13185

作者:Rohith Peddi,Saurabh,Shravan Shanmugam,Likhitha Pallapothula,Yu Xiang,Parag Singla,Vibhav Gogate

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain fundamentally frame-centric, Spatio-temporal scene graphs, evolving object interactions, World Scene Graph, existing methods remain

备注: [this https URL](https://github.com/rohithpeddi/WorldSGG)

点击查看摘要

Abstract:Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

6. 【2603.13182】Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

链接https://arxiv.org/abs/2603.13182

作者:Hiba Adil Al-kharsan,Róbert Rajkó

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, computer-assisted diagnosis systems, resonance imaging, plays a sensitive, Brain tumor classification

备注: 30 pages, 29 figures

点击查看摘要

Abstract:Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen's d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial this http URL findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.

7. 【2603.13176】Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

链接https://arxiv.org/abs/2603.13176

作者:Dingcheng Huang,Xiaotong Zhang,Kamal Youcef-Toumi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:jointly extract visual, human agents intelligently, modules jointly extract, modern human-robot collaboration, achieve comprehensive scene

备注: Accepted to ICRA 2026

点击查看摘要

Abstract:In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

8. 【2603.13163】owards Faithful Multimodal Concept Bottleneck Models

链接https://arxiv.org/abs/2603.13163

作者:Pierre Moreau,Emeline Pineau Ferrand,Yann Choho,Benjamin Wong,Annabelle Blangero,Milan Bhan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Concept Bottleneck Models, Bottleneck Models, interpretable models, Concept Bottleneck, layer of human-interpretable

备注

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.

9. 【2603.13121】FDeID-Toolbox: Face De-Identification Toolbox

链接https://arxiv.org/abs/2603.13121

作者:Hui Wei,Hao Yu,Guoying Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词

备注: Technical Report. Codebase: [this https URL](https://github.com/infraface/FDeID-Toolbox)

点击查看摘要

None

10. 【2603.13119】Geometry-Guided Camera Motion Understanding in VideoLLMs

链接https://arxiv.org/abs/2603.13119

作者:Haoan Feng,Sri Harsha Musunuri,Guan-Ming Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:shapes visual perception, current video-capable vision-language, fundamental geometric signal, video-capable vision-language models, textbf

备注: 10 pages, 7 figures, supplementary included

点击查看摘要

Abstract:Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at this https URL.

11. 【2603.13118】NOIR: Neural Operator mapping for Implicit Representations

链接https://arxiv.org/abs/2603.13118

作者:Sidaty El Hadramy,Nazim Haouchine,Michael Wehrli,Philippe C. Cattin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:continuous function spaces, grid-based deep learning, paper presents NOIR, reframes core medical, core medical imaging

备注

点击查看摘要

Abstract:This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: this https URL.

12. 【2603.13108】Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

链接https://arxiv.org/abs/2603.13108

作者:Guoqiang Zhao,Zhe Yang,Sheng Wu,Fei Teng,Mengfei Duan,Yuanfan Zheng,Kai Luo,Kailun Yang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:imagery provides holistic, panoramic multimodal occupancy, quadruped robots, panoramic multimodal, Vertical Jitter Compensation

备注: The dataset and code will be publicly released at [this https URL](https://github.com/SXDR/PanoMMOcc)

点击查看摘要

Abstract:Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at this https URL, along with the calibration tools released at this https URL.

13. 【2603.13102】BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending

链接https://arxiv.org/abs/2603.13102

作者:Matteo Ballegeer,Dries F. Benoit

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:CAD designs early, required effort, CAD designs, designs early, manufacturing process selection

备注

点击查看摘要

Abstract:Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM). Despite advances in deep learning for CAD and its widespread use in manufacturing process selection, learning-based approaches for predicting manufacturability within a specific process remain limited. Two key challenges limit progress: inconsistency across prior work in how manufacturability is defined and consequently in the associated learning targets, and a scarcity of suitable datasets. Existing labels vary significantly: they may reflect intrinsic design constraints or depend on specific manufacturing capabilities (such as available tools), and they range from discrete feasibility checks to continuous complexity measures. Furthermore, industrial datasets typically contain only manufacturable parts, offering little signal for infeasible cases, while existing synthetic datasets focus on simple geometries and subtractive processes. To address these gaps, we propose a taxonomy of manufacturability metrics along the axes of configuration dependence and measurement type, allowing clearer scoping of generalizability and learning objectives. Next, we introduce BenDFM, the first synthetic dataset for manufacturability assessment in sheet metal bending. BenDFM contains 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware bending simulations, providing both folded and unfolded geometries and multiple manufacturability labels across the taxonomy, enabling systematic study of previously unexplored learning-based DFM challenges. We benchmark two state-of-the-art 3D learning architectures on BenDFM, showing that graph-based representations that capture relationships between part surfaces achieve better accuracy, and that predicting metrics that depend on specific manufacturing setups remains more challenging.

14. 【2603.13099】Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

链接https://arxiv.org/abs/2603.13099

作者:Wayner Barrios,SouYoung Jin

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:evaluates multimodal reasoning, verifiable intermediate steps, instances that evaluates, evaluates multimodal, verifiable intermediate

备注

点击查看摘要

Abstract:We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

15. 【2603.13098】SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

链接https://arxiv.org/abs/2603.13098

作者:Ruogu Li,Sikai Li,Yao Mu,Mingyu Ding

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale dataset comprising, semantic-driven CAD modeling, training and fine-tuning, CAD modeling, CAD

备注: Accept by ICRA 2026

点击查看摘要

Abstract:We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

16. 【2603.13091】Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

链接https://arxiv.org/abs/2603.13091

作者:Seunghwan Bang,Hwanjun Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:embodied agents increases, existing benchmarks largely, benchmarks largely emphasize, spatiotemporal video understanding, largely emphasize extractive

备注: 35 pages, 8 figures, 21 tables

点击查看摘要

Abstract:The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

17. 【2603.13089】V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

链接https://arxiv.org/abs/2603.13089

作者:Shenghe Zheng,Junpeng Jiang,Wenbo Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:internalize rich structural, Large-scale video generative, Large-scale video, rich structural, trained on vast

备注: Transfer the prior knowledge of video generative models to image restoration tasks

点击查看摘要

Abstract:Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

18. 【2603.13085】Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

链接https://arxiv.org/abs/2603.13085

作者:Jose Marie Antonio Miñoza,Paulo Mario P. Medina,Sebastian C. Ibañez

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Machine Learning (stat.ML)

关键词:remains challenging due, mechanisms remains challenging, attention mechanisms remains, remains challenging, challenging due

备注

点击查看摘要

Abstract:Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width $m = \Omega(\kappa^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.

19. 【2603.13082】InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

链接https://arxiv.org/abs/2603.13082

作者:Yebin Yang,Di Wen,Lei Qi,Weitong Kong,Junwei Zheng,Ruiping Liu,Yufan Chen,Chengzhi Wu,Kailun Yang,Yuqian Fu,Danda Pani Paudel,Luc Van Gool,Kunyu Peng

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:limited paired data, Text-guided Multi-human Motion, motion editing, Multi-human Motion Editing, single-person scenarios

备注: The dataset and code will be released at [this https URL](https://github.com/YNG916/InterEdit)

点击查看摘要

Abstract:Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at this https URL.

20. 【2603.13077】Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

链接https://arxiv.org/abs/2603.13077

作者:Yihang Zhou,Chao Lin,Hideki Kikumoto,Ryozo Ooka,Sibo Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-time rooftop wind-speed, air mobility systems, urban air mobility, wind control systems, rooftop wind-speed distribution

备注

点击查看摘要

Abstract:Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.

21. 【2603.13070】Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

链接https://arxiv.org/abs/2603.13070

作者:Yunzhuo Chen,Jordan Vice,Naveed Akhtar,Nur Al Hasan Haldar,Ajmal Mian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reproduce training images, produce impressive visuals, diffusion models, creating copyright, privacy risks

备注

点击查看摘要

Abstract:State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.

22. 【2603.13069】Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems

链接https://arxiv.org/abs/2603.13069

作者:Ann Dooms

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Dynamical Systems (math.DS)

关键词:Partitioned Iterated Function, Iterated Function System, denoising diffusion model, diffusion model, Partitioned Iterated

备注

点击查看摘要

Abstract:What is a diffusion model actually doing when it turns noise into a photograph? We show that the deterministic DDIM reverse chain operates as a Partitioned Iterated Function System (PIFS) and that this framework serves as a unified design language for denoising diffusion model schedules, architectures, and training objectives. From the PIFS structure we derive three computable geometric quantities: a per-step contraction threshold $L^*_t$, a diagonal expansion function $f_t(\lambda)$ and a global expansion threshold $\lambda^{**}$. These quantities require no model evaluation and fully characterize the denoising dynamics. They structurally explain the two-regime behavior of diffusion models: global context assembly at high noise via diffuse cross-patch attention and fine-detail synthesis at low noise via patch-by-patch suppression release in strict variance order. Self-attention emerges as the natural primitive for PIFS contraction. The Kaplan-Yorke dimension of the PIFS attractor is determined analytically through a discrete Moran equation on the Lyapunov spectrum. Through the study of the fractal geometry of the PIFS, we derive three optimal design criteria and show that four prominent empirical design choices (the cosine schedule offset, resolution-dependent logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling) each arise as approximate solutions to our explicit geometric optimization problems tuning theory into practice.

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Dynamical Systems (math.DS)

Cite as:
arXiv:2603.13069 [cs.LG]

(or
arXiv:2603.13069v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.13069

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2603.13057】Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

链接https://arxiv.org/abs/2603.13057

作者:Yuki Hirakawa,Takashi Wada,Ryotaro Shimizu,Takuya Furusawa,Yuki Saito,Ryosuke Araki,Tianwei Chen,Fan Mo,Yoshimitsu Aoki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fréchet Inception Distance, Kernel Inception Distance, image-based Virtual Try-ON, Virtual Try-ON, Inception Distance

备注

点击查看摘要

Abstract:Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.

24. 【2603.13056】am RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

链接https://arxiv.org/abs/2603.13056

作者:Elena Ryumina(1),Maxim Markitantov(1),Alexandr Axyonov(1),Dmitry Ryumin(1),Mikhail Dolgushin(1),Denis Dresvyanskiy(2),Alexey Karpov(1 and 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) ITMO University, St. Petersburg, Russia)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Continuous emotion recognition, challenging problem due, Continuous emotion, head pose, valence-arousal estimation ITW

备注: 8 pages, 1 figure

点击查看摘要

Abstract:Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.

25. 【2603.13054】opo-R1: Detecting Topological Anomalies via Vision-Language Models

链接https://arxiv.org/abs/2603.13054

作者:Meilong Xu,Qingqiao Hu,Xiaoling Hu,Shahira Abousamra,Xin Yu,Weimin Lyu,Kehan Qi,Dimitris Samaras,Chao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:nerve fibers, blood vessels, road networks, correctness is crucial, crucial for tubular

备注: 28 pages, 6 figures

点击查看摘要

Abstract:Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

26. 【2603.13044】Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

链接https://arxiv.org/abs/2603.13044

作者:Vanessa Borst,Samuel Kounev

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:clinical decision support, decision support systems, fundamental component, component of computer-assisted, computer-assisted diagnosis

备注: Under review, MICCAI 2026

点击查看摘要

Abstract:Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.

27. 【2603.13033】ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

链接https://arxiv.org/abs/2603.13033

作者:Yanpeng Zhao,Wentao Ding,Hongtao Li,Baoxiong Jia,Zilong Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:recent trend, trend in vision-language, embodied domains, vision-language models, spatial reasoning

备注

点击查看摘要

Abstract:A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

28. 【2603.13032】Multimodal OCR: Parse Anything from Documents

链接https://arxiv.org/abs/2603.13032

作者:Handong Zheng,Yumeng Li,Kaile Zhang,Liang Xin,Guangwei Zhao,Hao Liu,Jiayu Chen,Jie Lou,Jiyu Qiu,Qi Fu,Rui Yang,Shuo Jiang,Weijian Luo,Weijie Su,Weijun Zhang,Xingyu Zhu,Yabin Li,Yiwei ma,Yu Chen,Zhaohui Yu,Guang Yang,Colin Zhang,Lei Zhang,Yuliang Liu,Xiang Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present Multimodal OCR, unified textual representations, jointly parses text, OCR Arena Elo, http URL

备注

点击查看摘要

Abstract:We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed this http URL, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate this http URL from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, this http URL achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at this https URL.

29. 【2603.13027】SortScrews: A Dataset and Baseline for Real-time Screw Classification

链接https://arxiv.org/abs/2603.13027

作者:Tianhao Fu,Bingxuan Yang,Juncheng Guo,Shrena Sribalan,Yucheng Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Automatic identification, industrial automation, inventory management, important for industrial, Automatic

备注

点击查看摘要

Abstract:Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single-object scenarios commonly encountered in automated sorting systems. In this work, we introduce $\textbf{SortScrews}$, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at $512\times512$ resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet-B0 and ResNet-18 classifiers pretrained on ImageNet. In addition, we conduct a well-explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2603.13027 [cs.CV]

(or
arXiv:2603.13027v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.13027

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
30. 【2603.13024】SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

链接https://arxiv.org/abs/2603.13024

作者:Sampath Rapuri,Lalithkumar Seenivasan,Dominik Schneider,Roger Soberanis-Mukul,Yufan He,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Pengfei Guo,Daguang Xu,Mathias Unberath

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:address fundamental challenges, world model capable, Surgical Action World, generating realistic surgical, surgical world model

备注: The manuscript is under review

点击查看摘要

Abstract:A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation -- from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) -- a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.

31. 【2603.12998】A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

链接https://arxiv.org/abs/2603.12998

作者:Tangzheng Lian,Guanyu Hu,Yijing Ren,Dimitrios Kollias,Oya Celiktutan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inherit social biases, achieved remarkable performance, textbf, recent studies, achieved remarkable

备注

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

32. 【2603.12997】Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis

链接https://arxiv.org/abs/2603.12997

作者:Chen Feng,Zhuo Zhi,Zhao Huang,Jiawei Ge,Ling Xiao,Nicu Sebe,Georgios Tzimiropoulos,Ioannis Patras

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:optimal clean-data classifier, Statistically consistent methods, theoretically grounded solution, Statistically consistent, consistent methods based

备注: Accepted to CVPR2026

点击查看摘要

Abstract:Statistically consistent methods based on the noise transition matrix ($T$) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal clean-data classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating $T$. The common assumption is that, given a perfect $T$, noise-correction methods would recover their theoretical advantage. In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a perfect, oracle transition matrix. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a $T$-estimation problem, but stems from a more deeply rooted flaw. To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.

33. 【2603.12989】st-Time Attention Purification for Backdoored Large Vision Language Models

链接https://arxiv.org/abs/2603.12989

作者:Zhifang Zhang,Bojun Yang,Shuo He,Weitong Chen,Wei Emma Zhang,Olaf Maennel,Lei Feng,Miao Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:adversaries insert trigger-embedded, strong multimodal performance, large vision-language models, insert trigger-embedded samples, large vision-language

备注

点击查看摘要

Abstract:Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.

34. 【2603.12988】Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

链接https://arxiv.org/abs/2603.12988

作者:Aditya Parikh,Aasa Feragen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Fair Disease Diagnosis, lung disease diagnosis, Disease Diagnosis Challenge, multi-class lung disease, Squamous Cell Carcinoma

备注

点击查看摘要

Abstract:We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories -- Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma -- with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at this https URL

35. 【2603.12976】SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning

链接https://arxiv.org/abs/2603.12976

作者:Md Anwar Hossen,Nathan R. Tallent,Luanzheng Guo,Ali Jannesary

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Scientific discovery increasingly, Scientific discovery, discovery increasingly requires, extreme class imbalance, increasingly requires learning

备注

点击查看摘要

Abstract:Scientific discovery increasingly requires learning on federated datasets, fed by streams from high-resolution instruments, that have extreme class imbalance. Current ML approaches either require impractical data aggregation or fail due to class imbalance. Existing coreset selection methods rely on local heuristics, making them unaware of the global data landscape and prone to sub-optimal and non-representative pruning. To overcome these challenges, we introduce SCOPE (Semantic Coreset using Orthogonal Projection Embeddings for Federated learning), a coreset framework for federated data that filters anomalies and adaptively prunes redundant data to mitigate long-tail skew. By analyzing the latent space distribution, we score each data point using a representation score that measures the reliability of core class features, a diversity score that quantifies the novelty of orthogonal residuals, and a boundary proximity score that indicates similarity to competing classes. Unlike prior methods, SCOPE shares only scalar metrics with a federated server to construct a global consensus, ensuring communication efficiency. Guided by the global consensus, SCOPE dynamically filters local noise and discards redundant samples to counteract global feature skews. Extensive experiments demonstrate that SCOPE yields competitive global accuracy and robust convergence, all while achieving exceptional efficiency with a 128x to 512x reduction in uplink bandwidth, a 7.72x wall-clock acceleration and reduced FLOP and VRAM footprints for local coreset selection.

36. 【2603.12938】hinking in Streaming Video

链接https://arxiv.org/abs/2603.12938

作者:Zikang Liu,Longteng Guo,Handong Li,Ru Zhen,Xingjian He,Ruyi Ji,Xiaoming Ren,Yanhao Zhang,Haonan Lu,Jing Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multimodal agents operating, continuous video streams, Real-time understanding, dynamic environments, interactive assistants

备注

点击查看摘要

Abstract:Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at this https URL

37. 【2603.12937】SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization

链接https://arxiv.org/abs/2603.12937

作者:Tianwei Ye,Xiaoguang Mei,Yifan Xia,Fan Fan,Jun Huang,Jiayi Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Establishing accurate, critical challenge, remains a critical, Establishing, topological noise

备注: 27 pages, 13 figures

点击查看摘要

Abstract:Establishing accurate point-to-point correspondences between non-rigid 3D shapes remains a critical challenge, particularly under non-isometric deformations and topological noise. Existing functional map pipelines suffer from ambiguities that geometric descriptors alone cannot resolve, and spatial inconsistencies inherent in the projection of truncated spectral bases to dense pointwise correspondences. In this paper, we introduce SGMatch, a learning-based framework for semantic-guided non-rigid shape matching. Specifically, we design a Semantic-Guided Local Cross-Attention module that integrates semantic features from vision foundation models into geometric descriptors while preserving local structural continuity. Furthermore, we introduce a regularization objective based on conditional flow matching, which supervises a time-varying velocity field to encourage spatial smoothness of the recovered correspondences. Experimental results on multiple benchmarks demonstrate that SGMatch achieves competitive performance across near-isometric settings and consistent improvements under non-isometric deformations and topological noise.

38. 【2603.12936】MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

链接https://arxiv.org/abs/2603.12936

作者:WenBo Xu,Liu Liu,Li Zhang,Dan Guo,RuoNan Liu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:interactable articulated assets, Converting static, interactable articulated, crucial for embodied, Converting

备注: 5 figures

点击查看摘要

Abstract:Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.

39. 【2603.12930】Rethinking VLMs for Image Forgery Detection and Localization

链接https://arxiv.org/abs/2603.12930

作者:Shaofeng Guo,Jiequan Cui,Richang Hong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Intelligence Generated Content, Artificial Intelligence Generated, Generated Content, posing significant challenges, Artificial Intelligence

备注: 8pages

点击查看摘要

Abstract:With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and this http URL is available at: this https URL.

40. 【2603.12918】VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

链接https://arxiv.org/abs/2603.12918

作者:Juhye Park,Wooju Lee,Dasol Hong,Changki Sung,Youngwoo Seo,Dongwan Kang,Hyun Myung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate global localization, Accurate global, driving and robotics, multipath effects, global localization

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

41. 【2603.12915】Stake the Points: Structure-Faithful Instance Unlearning

链接https://arxiv.org/abs/2603.12915

作者:Kiseong Hong,JungKyoo Shin,Eunwoo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:addresses privacy risks, addresses privacy, Machine unlearning, privacy risks, risks in pretrained

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

42. 【2603.12912】FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

链接https://arxiv.org/abs/2603.12912

作者:Xin Xu,Weilong Li,Wei Liu,Wenke Huang,Zhixi Yu,Bin Yang,Xiaoying Liao,Kui Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:learns domain-invariant representations, Distribution Aware Visual, Body Distribution Aware, Federated Domain Generalization, Aware Visual Prompt

备注

点击查看摘要

Abstract:Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints -- a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at this https URL.

43. 【2603.12905】DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification

链接https://arxiv.org/abs/2603.12905

作者:Joana Reuss,Ekaterina Gikalo,Marco Körner

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:significant data scarcity, severe class imbalance, label acquisition costs, high label acquisition, Real-world agricultural monitoring

备注: 20 pages, 9 Figures, 28 Tables

点击查看摘要

Abstract:Real-world agricultural monitoring is often hampered by severe class imbalance and high label acquisition costs, resulting in significant data scarcity. In few-shot learning (FSL) -- a framework specifically designed for data-scarce settings -- , training sets are often artificially balanced. However, this creates a disconnect from the long-tailed distributions observed in nature, leading to a distribution shift that undermines the model's ability to generalize to real-world agricultural tasks. We previously introduced Dirichlet Prior Augmentation (DirPA; Reuss et al., 2026a) to proactively mitigate the effects of such label distribution skews during model training. In this work, we extend the original study's geographical scope. Specifically, we evaluate this extended approach across multiple countries in the European Union (EU), moving beyond localized experiments to test the method's resilience across diverse agricultural environments. Our results demonstrate the effectiveness of DirPA across different geographical regions. We show that DirPA not only improves system robustness and stabilizes training under extreme long-tailed distributions, regardless of the target region, but also substantially improves individual class-specific performance by proactively simulating priors.

44. 【2603.12903】Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis

链接https://arxiv.org/abs/2603.12903

作者:Yinuo Jiang,Jun Cheng,Yiran Wang,Cheng Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Neural Radiance, Radiance Fields, shown remarkable success, LiDAR NVS

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.

45. 【2603.12893】Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

链接https://arxiv.org/abs/2603.12893

作者:David McAllister,Miika Aittala,Tero Karras,Janne Hellsten,Angjoo Kanazawa,Timo Aila,Samuli Laine

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

关键词:explicitly improve desirable, improve desirable aspects, post-training diffusion-based image, diffusion-based image synthesis, Reinforcement learning

备注: Code available at [this https URL](https://github.com/NVlabs/finite-difference-flow-optimization)

点击查看摘要

Abstract:Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

46. 【2603.12887】Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

链接https://arxiv.org/abs/2603.12887

作者:Mingkai Zhai,Wei Wang,Zongsheng Li,Quanying Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Epileptic seizure forecasting, clinically important, important yet challenging, challenging problem, seizure forecasting

备注

点击查看摘要

Abstract:Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.

47. 【2603.12886】A protocol for evaluating robustness to HE staining variation in computational pathology models

链接https://arxiv.org/abs/2603.12886

作者:Lydia A. Schönpflug,Nikki van den Berg,Sonali Andani,Nanda Horeweg,Jurriaan Barkey Wolf,Tjalling Bosse,Viktor H. Koelzer,Maxime W. Lafarge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deploying computational pathology, requiring systematic assessment, staining conditions, affects model prediction, staining variation remains

备注

点击查看摘要

Abstract:Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (HE) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to HE staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high HE intensity, low/high HE color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ($\Delta$ = 0.142). Robustness ranged from 0.007-0.079 ($\Delta$ = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across HE staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at this https URL .

48. 【2603.12873】RACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking

链接https://arxiv.org/abs/2603.12873

作者:Jiale Meng,Jie Zhang,Runyi Hu,Zhe-Ming Lu,Tianwei Zhang,Yiming Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structure-aware framework leveraging, TRACE exploits character, localized character encoding, embed data, propose TRACE

备注

点击查看摘要

Abstract:We propose TRACE, a structure-aware framework leveraging diffusion models for localized character encoding to embed data. Unlike existing methods that rely on edge features or pre-defined codebooks, TRACE exploits character structures that provide inherent resistance to noise interference due to their stability and unified representation across diverse characters. Our framework comprises three key components: (1) adaptive diffusion initialization that automatically identifies handle points, target points, and editing regions through specialized algorithms including movement probability estimator (MPE), target point estimation (TPE) and mask drawing model (MDM), (2) guided diffusion encoding for precise movement of selected point, and (3) masked region replacement with a specialized loss function to minimize feature alterations after the diffusion process. Comprehensive experiments demonstrate \name{}'s superior performance over state-of-the-art methods, achieving more than 5 dB improvement in PSNR and 5\% higher extraction accuracy following cross-media transmission. \name{} achieves broad generalizability across multiple languages and fonts, making it particularly suitable for practical document security applications.

49. 【2603.12864】Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

链接https://arxiv.org/abs/2603.12864

作者:Yifan Zhan,Zhengqing Chen,Qingjie Wang,Zhuo He,Muyao Niu,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Yinqiang Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:safety-critical edge cases, long tail, edge cases, major challenge, challenge in autonomous

备注

点击查看摘要

Abstract:A major challenge in autonomous driving is the "long tail" of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.

50. 【2603.12852】Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach

链接https://arxiv.org/abs/2603.12852

作者:Falko Kähler,Maxim Wille,Ole Schmedemann,Thorsten Schüppstuhl

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:free-form surfaces due, Abrasive flap wheels, finishing complex free-form, complex free-form surfaces, Abrasive flap

备注: 14 pages, 11 figures, 8 tables

点击查看摘要

Abstract:Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.

51. 【2603.12848】am LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

链接https://arxiv.org/abs/2603.12848

作者:Elena Ryumina(1),Alexandr Axyonov(1),Dmitry Sysoev(2),Timur Abdulkadirov(2),Kirill Almetov(2),Yulia Morozova(2),Dmitry Ryumin(1 and 2) ((1) St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia, (2) HSE University, St. Petersburg, Russia)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:challenging problem due, hesitancy recognition, behavioral state, ABAW Competition, unconstrained videos

备注: 8 pages, 2 figures

点击查看摘要

Abstract:Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

52. 【2603.12845】Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

链接https://arxiv.org/abs/2603.12845

作者:Fei Wang,Xinye Zheng,Kun Li,Yanyan Wei,Yuxin Liu,Ganpeng Hu,Tong Bao,Jingwen Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Predicting enzyme kinetic, defined biochemical conditions, Predicting enzyme, kinetic parameters quantifies, quantifies how efficiently

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ($k_\text{cat}$), Michaelis constant ($K_\text{m}$), and inhibition constant ($K_\text{i}$) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.

53. 【2603.12832】Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning

链接https://arxiv.org/abs/2603.12832

作者:Fuhai Chen,Pengpeng Huang,Junwen Wu,Hehong Zhang,Shiping Wang,Xiaoguang Ma,Xuri Ge

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:UAV Scene Change, Scene Change Captioning, UAV scene, generate natural language, natural language descriptions

备注: 20 pages,10 figures

点击查看摘要

Abstract:This paper proposes a novel task for UAV scene understanding - UAV Scene Change Captioning (UAV-SCC) - which aims to generate natural language descriptions of semantic changes in dynamic aerial imagery captured from a movable viewpoint. Unlike traditional change captioning that mainly describes differences between image pairs captured from a fixed camera viewpoint over time, UAV scene change captioning focuses on image-pair differences resulting from both temporal and spatial scene variations dynamically captured by a moving camera. The key challenge lies in understanding viewpoint-induced scene changes from UAV image pairs that share only partially overlapping scene content due to viewpoint shifts caused by camera rotation, while effectively exploiting the relative orientation between the two images. To this end, we propose a Hierarchical Dual-Change Collaborative Learning (HDC-CL) method for UAV scene change captioning. In particular, a novel transformer, \emph{i.e.} Dynamic Adaptive Layout Transformer (DALT) is designed to adaptively model diverse spatial layouts of the image pair, where the interrelated features derived from the overlapping and non-overlapping regions are learned within the flexible and unified encoding layer. Furthermore, we propose a Hierarchical Cross-modal Orientation Consistency Calibration (HCM-OCC) method to enhance the model's sensitivity to viewpoint shift directions, enabling more accurate change captioning. To facilitate in-depth research on this task, we construct a new benchmark dataset, named UAV-SCC dataset, for UAV scene change captioning. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on this task. The dataset and code will be publicly released upon acceptance of this paper.

54. 【2603.12829】coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

链接https://arxiv.org/abs/2603.12829

作者:Chunhan Li,Qifeng Wu,Jia-Hui Pan,Ka-Hei Hui,Jingyu Hu,Yuming Jiang,Bin Sheng,Xihui Liu,Wenjuan Gong,Zhengzhe Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faithfully composing multiple, composing multiple objects, advanced rapidly, complex scenes, models still struggle

备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.

55. 【2603.12824】NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

链接https://arxiv.org/abs/2603.12824

作者:Zhuchenyang Liu,Yao Zhang,Yu Xiao

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Vision-Language Model, visual document retrieval, based retrievers, advanced visual document, retrievers have advanced

备注

点击查看摘要

Abstract:Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

56. 【2603.12823】Adaptive Vision-Language Model Routing for Computer Use Agents

链接https://arxiv.org/abs/2603.12823

作者:Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, User Interface, Graphical User, translate natural-language instructions, instructions into Graphical

备注

点击查看摘要

Abstract:Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: this https URL.

57. 【2603.12816】Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning

链接https://arxiv.org/abs/2603.12816

作者:Gyutae Oh,Jungwoo Bae,Jitae Shin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Continual learning, storing past data, domain-incremental learning, suffers from catastrophic, catastrophic forgetting

备注: 29 page, 10 figures

点击查看摘要

Abstract:Continual learning (CL) suffers from catastrophic forgetting, which is exacerbated in domain-incremental learning (DIL) where task identifiers are unavailable and storing past data is infeasible. While prompt-based CL (PCL) adapts representations with a frozen backbone, we observe that prompt-only improvements are often insufficient due to suboptimal prompt selection and classifier-level instability under domain shifts. We propose Residual SODAP, which jointly performs prompt-based representation adaptation and classifier-level knowledge preservation. Our framework combines $\alpha$-entmax sparse prompt selection with residual aggregation, data-free distillation with pseudo-feature replay, prompt-usage--based drift detection, and uncertainty-aware multi-loss balancing. Across three DIL benchmarks without task IDs or extra data storage, Residual SODAP achieves state-of-the-art AvgACC/AvgF of 0.850/0.047 (DR), 0.760/0.031 (Skin Cancer), and 0.995/0.003 (CORe50).

58. 【2603.12811】OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution

链接https://arxiv.org/abs/2603.12811

作者:Shijie Zhao,Xuanyu Zhang,Bin Chen,Weiqi Li,Qunliang Xing,Kexin Zhang,Yan Wang,Junlin Li,Li Zhang,Jian Zhang,Tianfan Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Aligning generative real-world, generative real-world image, real-world image super-resolution, image super-resolution models, human visual preference

备注: Super-Resolution, Reinforcement Learning

点击查看摘要

Abstract:Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception--fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.

59. 【2603.12799】What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

链接https://arxiv.org/abs/2603.12799

作者:Sen Nie,Jie Zhang,Zhongqi Wang,Zhaoyang Wei,Shiguang Shan,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Achieving adversarial robustness, inevitably compromises accuracy, Achieving adversarial, inevitably compromises, presenting a long-standing

备注: 28 pages

点击查看摘要

Abstract:Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at this https URL.

60. 【2603.12796】Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

链接https://arxiv.org/abs/2603.12796

作者:Yang Chen,Yi Yu,Jiaming He,Yueqi Duan,Zheng Zhu,Yap-Peng Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Gaussian representation exposes, deliver high-quality rendering, Gaussian Splatting, deliver high-quality

备注

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) deliver high-quality rendering, yet the Gaussian representation exposes a new attack surface, the resource-targeting attack. This attack poisons training images, excessively inducing Gaussian growth to cause resource exhaustion. Although efficiency-oriented methods such as smoothing, thresholding, and pruning have been explored, these spatial-domain strategies operate on visible structures but overlook how stealthy perturbations distort the underlying spectral behaviors of training data. As a result, poisoned inputs introduce abnormal high-frequency amplifications that mislead 3DGS into interpreting noisy patterns as detailed structures, ultimately causing unstable Gaussian overgrowth and degraded scene fidelity. To address this, we propose \textbf{Spectral Defense} in Gaussian and image fields. We first design a 3D frequency filter to selectively prune Gaussians exhibiting abnormally high frequencies. Since natural scenes also contain legitimate high-frequency structures, directly suppressing high frequencies is insufficient, and we further develop a 2D spectral regularization on renderings, distinguishing naturally isotropic frequencies while penalizing anisotropic angular energy to constrain noisy patterns. Experiments show that our defense builds robust, accurate, and secure 3DGS, suppressing overgrowth by up to $5.92\times$, reducing memory by up to $3.66\times$, and improving speed by up to $4.34\times$ under attacks.

61. 【2603.12793】Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

链接https://arxiv.org/abs/2603.12793

作者:Yichen Zhang,Da Peng,Zonghao Guo,Zijian Zhang,Xuesong Yang,Tong Sun,Shichu Sun,Yidan Zhang,Yanghao Li,Haiyan Zhao,Wang Xu,Qi Shi,Yangang Sun,Chi Chen,Shuo Wang,Yukun Yan,Xu Han,Qiang Ma,Wei Ke,Liang Wang,Zhiyuan Liu,Maosong Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:recent cutting-edge topic, unify visual comprehension, recent cutting-edge, cutting-edge topic, gated detail residuals

备注: 17 pages, 5 figures

点击查看摘要

Abstract:A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

62. 【2603.12789】Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

链接https://arxiv.org/abs/2603.12789

作者:Sangmin Kim,Minhyuk Hwang,Geonho Cha,Dongyoon Wee,Jaesik Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, foundation models, surrounding environments, models have led, led to growing

备注: Project page: [this https URL](https://nstar1125.github.io/chromm)

点击查看摘要

Abstract:Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: this https URL.

63. 【2603.12788】hink and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

链接https://arxiv.org/abs/2603.12788

作者:Shuchang Lyu,Haiquan Wen,Guangliang Cheng,Meng Li,Zheng Zhou,You Zhou,Dingding Yao,Zhenwei Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significantly enhanced multi-step, Recent advances, multi-step reasoning capabilities, enhanced multi-step reasoning, remote sensing

备注: 22 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at this https URL.

64. 【2603.12787】Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

链接https://arxiv.org/abs/2603.12787

作者:Mengya Xu,Daiyun Shen,Jie Zhang,Hon Chi Yip,Yujia Gao,Cheng Chen,Dillan Imans,Yonghao Long,Yiru Ye,Yixiao Liu,Rongyun Mai,Kai Chen,Hongliang Ren,Yutong Ban,Guangsuo Wang,Francis Wong,Chi-Fai Ng,Kee Yuan Ngiam,Russell H. Taylor,Daguang Xu,Yueming Jin,Qi Dou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Artificial intelligence, transform surgical practice, potential to transform, basic surgical actions, BSA

备注: 34 pages, 8 figures

点击查看摘要

Abstract:Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons' evaluation of the language model's output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

65. 【2603.12773】Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

链接https://arxiv.org/abs/2603.12773

作者:Guodong Fan,Shengning Zhou,Genji Yuan,Huiyu Li,Jingchun Zhou,Jinjiang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:learning-based underwater image, recent years, learning-based underwater, techniques have rapidly, rapidly evolved

备注: Accepted as an Oral presentation at AAAI 2026

点击查看摘要

Abstract:In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

66. 【2603.12772】PVI: Plug-in Visual Injection for Vision-Language-Action Models

链接https://arxiv.org/abs/2603.12772

作者:Zezhou Zhang,Songxin Zhang,Xiao Xiong,Junjie Zhang,Zejian Xie,Jingyi Xi,Zunyao Mao,Zan Mao,Zhixin Mai,Zhuoyang Song,Jiaxing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:VLA architectures, flow-matching action expert, language-conditioned manipulation, action expert, architectures that pair

备注

点击查看摘要

Abstract:VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.

67. 【2603.12766】Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

链接https://arxiv.org/abs/2603.12766

作者:Shifeng Chen,Yihui Li,Jun Liao,Hongyu Yang,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, static scene editing, dynamic scene editing, scene editing, high-quality static scene

备注: [this https URL](https://junliao2025.github.io/Catalyst4D-ProjectPage/)

点击查看摘要

Abstract:Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.

68. 【2603.12764】SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

链接https://arxiv.org/abs/2603.12764

作者:Xiang Li,Heqian Qiu,Lanxiao Wang,Benliu Qiu,Fanman Meng,Linfeng Xu,Hongliang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assembly quality control, Imitation Error Detection, Exo Imitation Error, Error detection, industrial training

备注: This article was accepted by CVPR 2026

点击查看摘要

Abstract:Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at this https URL.

69. 【2603.12762】rraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

链接https://arxiv.org/abs/2603.12762

作者:Nazar Puriy,Johannes Jakubik,Benedikt Blumenstiel,Konrad Schindler

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Earth observation, Earth observation data, real-world Earth observation, approach to multimodal, Earth

备注

点击查看摘要

Abstract:We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters -- a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.

70. 【2603.12760】HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks

链接https://arxiv.org/abs/2603.12760

作者:Xiaoyu Li,Yuhang Liu,Zheng Luo,Xuanshuo Kang,Fangqi Lou,Xiaohua Wu,Zihan Xiong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Multimodal Models, paradigm for Large, task adaptation, Large Multimodal, significant paradigm

备注: Accepted to CVPR 2026. Code available at [this https URL](https://github.com/bbbandari/HiFICL)

点击查看摘要

Abstract:In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of "virtual key-value pairs" to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at this https URL.

71. 【2603.12759】SAP: Segment Any 4K Panorama

链接https://arxiv.org/abs/2603.12759

作者:Lutao Jiang,Zidong Cao,Weikai Chen,Xu Zheng,Yuanhuiyi Lyu,Zhenyang Li,Zeyu HU,Yingda Yin,Keyang Luo,Runze Zhang,Kai Yan,Shengju Qian,Haidi Fan,Yifan Peng,Xin Wang,Hui Xiong,Ying-Cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Promptable instance segmentation, Promptable instance, widely adopted, adopted in embodied, imagery often degrades

备注: Project Page: [this https URL](https://lutao2021.github.io/SAP_Page/)

点击查看摘要

Abstract:Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360° panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high-resolution panoramic instance-level segmentation. We reformulate panoramic segmentation as fixed-trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory-aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross-view propagation. To enable large-scale supervision, we synthesize 183,440 4K-resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory-aligned paradigm, SAP generalizes effectively to real-world 360° images, achieving +17.2 zero-shot mIoU gain over vanilla SAM2 of different sizes on real-world 4K panorama benchmark.

72. 【2603.12758】FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking

链接https://arxiv.org/abs/2603.12758

作者:Cheng Ju,Zejing Zhao,Akio Namiki

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Reliable multi-object tracking, Reliable multi-object, robotic systems operating, dynamic environments, systems operating

备注

点击查看摘要

Abstract:Reliable multi-object tracking (MOT) is essential for robotic systems operating in complex and dynamic environments. Despite recent advances in detection and association, online MOT methods remain vulnerable to identity switches caused by frequent occlusions and object overlap, where incorrect associations can propagate over time and degrade tracking reliability. We present a lightweight post-association correction framework (FC-Track) for online MOT that explicitly targets overlap-induced mismatches during inference. The proposed method suppresses unreliable appearance updates under high-overlap conditions using an Intersection over Area (IoA)-based filtering strategy, and locally corrects detection-to-tracklet mismatches through appearance similarity comparison within overlapped tracklet pairs. By preventing short-term mismatches from propagating, our framework effectively mitigates long-term identity switches without resorting to global optimization or re-identification. The framework operates online without global optimization or re-identification, making it suitable for real-time robotic applications. We achieve 81.73 MOTA, 82.81 IDF1, and 66.95 HOTA on the MOT17 test set with a running speed of 5.7 FPS, and 77.52 MOTA, 80.90 IDF1, and 65.67 HOTA on the MOT20 test set with a running speed of 0.6 FPS. Specifically, our framework FC-Track produces only 29.55% long-term identity switches, which is substantially lower than existing online trackers. Meanwhile, our framework maintains state-of-the-art performance on the MOT20 benchmark.

73. 【2603.12751】Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

链接https://arxiv.org/abs/2603.12751

作者:James Akl,Jose Nicolas Avendano Arbelaez,James Barabas,Jennifer L. Barry,Kalie Ching,Noam Eshed,Jiahui Fu,Michel Hidalgo,Andrew Hoelscher,Tushar Kusnur,Andrew Messing,Zachary Nagler,Brian Okorn,Mauro Passerino,Tim J. Perkins,Eric Rosen,Ankit Shah,Tanmay Shankar,Scott Shaw

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:objects, robot quickly identify, quickly identify, objects shown, Show

备注

点击查看摘要

Abstract:How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.

74. 【2603.12749】SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

链接https://arxiv.org/abs/2603.12749

作者:Zheng Gao,Yifan Yang,Xiaoyu Li,Xiaoyan Feng,Haoran Fan,Yang Song,Jiaojiao Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:content-independent noise patterns, textbf, underline, diffusion models, models has emerged

备注

点击查看摘要

Abstract:Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy semantic-aware watermark, we propose $\underline{\textbf{S}}$emantic $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$njection via $\underline{\textbf{C}}$ompartmentalized $\underline{\textbf{E}}$mbedding ($\textbf{SLICE}$). Our framework decouples image semantics into four semantic factors (subject, environment, action, and detail) and precisely anchors them to distinct regions in the initial Gaussian noise. This fine-grained semantic binding enables advanced watermark verification where semantic tampering is detectable and localizable. We theoretically justify why SLICE enables robust and reliable tamper localization and provides statistical guarantees on false-accept rates. Experimental results demonstrate that SLICE significantly outperforms existing baselines against advanced semantic-guided regeneration attacks, substantially reducing attack success while preserving image quality and semantic fidelity. Overall, SLICE offers a practical, training-free provenance solution that is both fine-grained in diagnosis and robust to realistic adversarial manipulations.

75. 【2603.12746】hinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

链接https://arxiv.org/abs/2603.12746

作者:Yuzhi Huang,Kairun Wen,Rongxin Gao,Dongxuan Liu,Yibin Lou,Jie Wu,Jing Xu,Jian Zhang,Zheng Yang,Yunlong Lin,Chenxin Li,Panwang Pan,Junbin Lu,Jingyan Jiang,Xinghao Ding,Yue Huang,Zhi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantic content evolve, Multimodal Large Language, Humans inhabit, Large Language Models, current Multimodal Large

备注

点击查看摘要

Abstract:Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at this https URL.

76. 【2603.12743】MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization

链接https://arxiv.org/abs/2603.12743

作者:Chenyang Zhu,Hongxiang Li,Xiu Li,Long Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Knowledge-aware Concept Customization, Concept, target concept, Concept customization, Concept customization typically

备注: Project Page: [this https URL](https://chenyangzhu1.github.io/MoKus/)

点击查看摘要

Abstract:Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.

77. 【2603.12722】CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

链接https://arxiv.org/abs/2603.12722

作者:Kaifan Zhang,Lihuo He,Junjie Ke,Yuqi Ji,Lukun Wu,Lizi Wang,Xinbo Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual stimuli reconstruction, EEG remains challenging, remains challenging due, Visual stimuli, EEG remains

备注

点击查看摘要

Abstract:Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: this https URL.

78. 【2603.12721】CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration

链接https://arxiv.org/abs/2603.12721

作者:Dongxu Zhang,Yingsen Wang,Yiding Sun,Haoran Xu,Peilin Fan,Jihua Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Robust point cloud, augmented reality, Robust point, Hybrid Attention Network, computer vision

备注

点击查看摘要

Abstract:Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model's robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model's generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{this https URL}{this https URL}.

79. 【2603.12719】IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration

链接https://arxiv.org/abs/2603.12719

作者:Dongxu Zhang,Jihua Zhu,Shiqi Li,Wenbiao Yan,Haoran Xu,Peilin Fan,Huimin Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:autonomous driving, environmental modeling, fundamental task, essential support, Point cloud registration

备注

点击查看摘要

Abstract:Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{this https URL}{this https URL}.

80. 【2603.12718】he COTe score: A decomposable framework for evaluating Document Layout Analysis models

链接https://arxiv.org/abs/2603.12718

作者:Jonathan Bourne,Mwiza Simbeye,Ishtar Govia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Document Layout analysis, Document Layout, Layout analysis, machine learning models, meaningful elements

备注: 6906 words, 4 Figures, 10 Tables,

点击查看摘要

Abstract:Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe's granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

81. 【2603.12716】UNIStainNet: Foundation-Model-Guided Virtual Staining of HE to IHC

链接https://arxiv.org/abs/2603.12716

作者:Jillur Rahman Saurav,Thuong Le Hoai Pham,Pritam Mukherjee,Paul Yi,Brent A. Orr,Jacob M. Luber

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:preliminary molecular insight, molecular insight directly, Virtual immunohistochemistry, providing preliminary molecular, staining from hematoxylin

备注

点击查看摘要

Abstract:Virtual immunohistochemistry (IHC) staining from hematoxylin and eosin (HE) images can accelerate diagnostics by providing preliminary molecular insight directly from routine sections, reducing the need for repeat sectioning when tissue is limited. Existing methods improve realism through contrastive objectives, prototype matching, or domain alignment, yet the generator itself receives no direct guidance from pathology foundation models. We present UNIStainNet, a SPADE-UNet conditioned on dense spatial tokens from a frozen pathology foundation model (UNI), providing tissue-level semantic guidance for stain translation. A misalignment-aware loss suite preserves stain quantification accuracy, and learned stain embeddings enable a single model to serve multiple IHC markers simultaneously. On MIST, UNIStainNet achieves state-of-the-art distributional metrics on all four stains (HER2, Ki67, ER, PR) from a single unified model, where prior methods typically train separate per-stain models. On BCI, it also achieves the best distributional metrics. A tissue-type stratified failure analysis reveals that remaining errors are systematic, concentrating in non-tumor tissue. Code is available at this https URL.

82. 【2603.12711】xt-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

链接https://arxiv.org/abs/2603.12711

作者:Jing Yang,Hui Xue,Shipeng Zhu,Pengfei Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:studies unsupervised cross-domain, paper studies unsupervised, labeled data, unsupervised cross-domain image, studies unsupervised

备注

点击查看摘要

Abstract:This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

83. 【2603.12708】HFP-SAM: Hierarchical Frequency Prompted SAM for Efficient Marine Animal Segmentation

链接https://arxiv.org/abs/2603.12708

作者:Pingping Zhang,Tianyu Yan,Yuhao Wang,Yang Liu,Tongdan Tang,Yili Ma,Long Lv,Feng Tian,Weibing Sun,and Huchuan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:segmenting marine animals, complex marine environments, Marine Animal Segmentation, aims at identifying, Animal Segmentation

备注: Accepted by TIP2026. More modifications may be performed

点击查看摘要

Abstract:Marine Animal Segmentation (MAS) aims at identifying and segmenting marine animals from complex marine environments. Most of previous deep learning-based MAS methods struggle with the long-distance modeling issue. Recently, Segment Anything Model (SAM) has gained popularity in general image segmentation. However, it lacks of perceiving fine-grained details and frequency information. To this end, we propose a novel learning framework, named Hierarchical Frequency Prompted SAM (HFP-SAM) for high-performance MAS. First, we design a Frequency Guided Adapter (FGA) to efficiently inject marine scene information into the frozen SAM backbone through frequency domain prior masks. Additionally, we introduce a Frequency-aware Point Selection (FPS) to generate highlighted regions through frequency analysis. These regions are combined with the coarse predictions of SAM to generate point prompts and integrate into SAM's decoder for fine predictions. Finally, to obtain comprehensive segmentation masks, we introduce a Full-View Mamba (FVM) to efficiently extract spatial and channel contextual information with linear computational complexity. Extensive experiments on four public datasets demonstrate the superior performance of our approach. The source code is publicly available at this https URL.

84. 【2603.12703】VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

链接https://arxiv.org/abs/2603.12703

作者:Pengyiang Liu,Zhongyue Shi,Hongye Hao,Qi Fu,Xueting Bi,Siwei Zhang,Xiaoyang Hu,Zitian Wang,Linjiang Huang,Si Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video understanding requires, understanding requires models, Video understanding, update world state, continuously track

备注

点击查看摘要

Abstract:Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.

85. 【2603.12696】HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation

链接https://arxiv.org/abs/2603.12696

作者:Pingcong Li,Zihui Yu,Bichi Zhang,Sören Schwertfeger

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:goal-oriented autonomy, shifting from rigid, OpenStreetMap Area Graph, Area Graph, VLN

备注

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.

86. 【2603.12693】HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification

链接https://arxiv.org/abs/2603.12693

作者:Andrey V. Savchenko,Kseniia Tsypliakova

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Affective Behavior Analysis, Affective Behavior, Behavior Analysis, article presents, Affective

备注: to be submitted to ABAW-10 workshop of CVPR 2026

点击查看摘要

Abstract:This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model's confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.

87. 【2603.12690】CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images

链接https://arxiv.org/abs/2603.12690

作者:Liangzheng Sun,Mengfan He,Xingyu Shao,Binbin Li,Zhiqiang Yan,Chunyu Li,Ziyang Meng,Fei Xing

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cross-modality visual localization, navigation and perception, feature matching, visual localization, feature matching plays

备注

点击查看摘要

Abstract:Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: this https URL.

88. 【2603.12688】STRAP-ViT: Segregated Tokens with Randomized -- Transformations for Defense against Adversarial Patches in ViTs

链接https://arxiv.org/abs/2603.12688

作者:Nandish Chattopadhyay,Anadi Goyal,Chandan Karfa,Anupam Chattopadhyay

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:force confident misclassifications, physically realizable localized, realizable localized noise, hijack Vision Transformers, adversarial noise

备注: Accepted for publication at IEEE/ACM Design Automation Conference (DAC) 2026

点击查看摘要

Abstract:Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least 50% of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2), and found to provide excellent robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.

89. 【2603.12685】RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

链接https://arxiv.org/abs/2603.12685

作者:Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Chengtao Lv,Sam Kwong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:region guidance stage, RGB-T Salient Object, Salient Object Detection, Region-guided Selective Optimization, Selective Optimization Network

备注

点击查看摘要

Abstract:This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.

90. 【2603.12680】Bin~Wan,G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

链接https://arxiv.org/abs/2603.12680

作者:Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Chengtao Lv,Sam Kwong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibit significant scale, Remote sensing images, significant scale variations, sensing images captured, complex backgrounds

备注

点击查看摘要

Abstract:Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.

91. 【2603.12669】Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

链接https://arxiv.org/abs/2603.12669

作者:Selim Furkan Tekin,Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Margaret L. Loper,Ling Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:works explore language-based, improve multi-model reasoning, explore language-based ensemble, growing number, works explore

备注

点击查看摘要

Abstract:With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at this https URL.

92. 【2603.12667】Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

链接https://arxiv.org/abs/2603.12667

作者:Haohang Huang,Jiayi Luo,Issam Qamhia,Erol Tutumluer,John M. Hart,Andrew J. Stolba

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:important functional components, transportation infrastructures, main skeleton, skeleton in assemblies, important functional

备注

点击查看摘要

Abstract:Aggregates, serving as the main skeleton in assemblies of construction materials, are important functional components in various building and transportation infrastructures. They can be used in unbound layer applications, e.g. pavement base and railroad ballast, bound applications of cement concrete and asphalt concrete, and as riprap and large-sized primary crushed rocks. Information on the size and shape or morphology of aggregates can greatly facilitate the Quality Assurance/Quality Control (QA/QC) process by providing insights of aggregate behavior during composition and packing. A full 3D characterization of aggregate particle morphology is difficult both during production in a quarry and at a construction site. Many aggregate imaging approaches have been developed to quantify the particle morphology by computer vision, including 2D image-based approaches that analyze particle silhouettes and 3D scanning-based methods that require expensive devices such as 3D laser scanners or X-Ray Computed Tomography (CT) equipment. This paper presents a flexible and cost-effective photogrammetry-based approach for the 3D reconstruction of aggregate particles. The proposed approach follows a marker-based design that enables background suppression, point cloud stitching, and scale referencing to obtain high-quality aggregate models. The accuracy of the reconstruction results was validated against ground-truth for selected aggregate samples. Comparative analyses were conducted on 2D and 3D morphological properties of the selected samples. Significant differences were found between the 2D and 3D statistics. Based on the presented approach, 3D shape information of aggregates can be obtained easily and at a low cost, thus allowing convenient aggregate inspection, data collection, and 3D morphological analysis.

93. 【2603.12663】Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization

链接https://arxiv.org/abs/2603.12663

作者:Kazuto Nakashima,Hojung Jung,Yuki Oto,Yumi Iwashita,Ryo Kurazume,Oscar Martinez Mozos

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Semantic place categorization, Semantic place, robots and vehicles, unfamiliar environments, essential tasks

备注: Published in Advanced Robotics on 31 Jul 2018

点击查看摘要

Abstract:Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.

94. 【2603.12659】AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

链接https://arxiv.org/abs/2603.12659

作者:Yu Hu,Jianyang Gu,Hao Liu,Yue Cao,Jozsef Hamari,Zheng Liu,Mohsen Zardadi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited semantic coverage, imagery remains challenging, remains challenging due, Adapting vision-language models, sensing imagery remains

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

95. 【2603.12657】VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

链接https://arxiv.org/abs/2603.12657

作者:Yuhang Ming,Tingkang Xi,Xingrui Yang,Lixin Yang,Yong Peng,Cewu Lu,Wanzeng Kong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:monocular videos remains, severe domain shifts, videos remains challenging, Scene-level neural volumetric, neural volumetric reconstruction

备注: 19 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.

96. 【2603.12655】VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

链接https://arxiv.org/abs/2603.12655

作者:Xiangyu Sun,Shijie Wang,Fengyi Zhang,Lin Liu,Caiyan Jia,Ziying Song,Zi Huang,Yadan Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain geometrically inconsistent, video frames devote, forecast scene evolution, generating future video, future video frames

备注

点击查看摘要

Abstract:World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

97. 【2603.12648】From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

链接https://arxiv.org/abs/2603.12648

作者:Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tianyi Wei,Xiaohang Zhan,Jiaqi Wang,Tong Wu,Xingang Pan,Dahua Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Group Relative Policy, Relative Policy Optimization, Relative Policy, Group Relative, flow models

备注

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

98. 【2603.12647】LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

链接https://arxiv.org/abs/2603.12647

作者:Ziyu Chen,Fan Zhu,Hui Zhu,Deyi Kong,Xinkai Kuang,Yujia Zhang,Chunmao Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Salient Gaussian Splatting, Gaussian Splatting method, Gaussian Splatting, Salient Gaussian, view synthesis

备注: 8 pages, 7 figures, conference

点击查看摘要

Abstract:Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

99. 【2603.12639】RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

链接https://arxiv.org/abs/2603.12639

作者:Ruicheng Zhang,Guangyu Chen,Zunnan Xu,Zihao Liu,Zhizhou Zhong,Mingyang Zhang,Jun Zhou,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces fundamental constraints, fundamental constraints due, Scalable Embodied, Embodied World Models, real-world interaction

备注

点击查看摘要

Abstract:Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering 97% average relative improvement on fine-grained manipulation tasks.

100. 【2603.12625】VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

链接https://arxiv.org/abs/2603.12625

作者:Ty Valencia,Burak Barlas,Varun Singhal,Ruchir Bhatia,Wei Yang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:commonly framed, signals are combined, Multimodal recommendation, Multimodal, feature fusion problem

备注: 13 pages, 4 figures, 1 table

点击查看摘要

Abstract:Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at this https URL.

101. 【2603.12624】Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

链接https://arxiv.org/abs/2603.12624

作者:Guodong Sun,Qihang Liang,Xingyu Pan,Moyun Liu,Yang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:structurally repetitive components, Accurate visual fault, complex operational environments, Accurate visual, transportation system maintenance

备注: 14 pages, 9 figures

点击查看摘要

Abstract:Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: this https URL

102. 【2603.12606】Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

链接https://arxiv.org/abs/2603.12606

作者:Zesheng Yang,Xi Jiang,Bingzhang Hu,Weili Guan,Runmin Cong,Guo-Jun Qi,Feng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Current vision-language detection, ground complex expressions, Current vision-language, struggle to accurately, accurately interpret

备注: 12 pages, 6 figures

点击查看摘要

Abstract:Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.

Comments:
12 pages, 6 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.12606 [cs.CV]

(or
arXiv:2603.12606v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.12606

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
103. 【2603.12605】A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering

链接https://arxiv.org/abs/2603.12605

作者:Pritham Kumar Jena,Bhavika Baburaj,Tushar Anand,Vedant Dutta,Vineeth Ulavala,Sk Aziz Ali

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:simple text prompts, industrial product design, computer-aided design, million ABC CAD, ABC CAD models

备注: 27 pages, accepted to IEEE CVF CVPR 2026

点击查看摘要

Abstract:Reverse engineering and rapid prototyping of computer-aided design (CAD) models from 3D scans, sketches, or simple text prompts are vital in industrial product design. However, recent advances in geometric deep learning techniques lack a multi-modal understanding of parametric CAD features stored in their boundary representation (BRep). This study presents the largest compilation of 10 million multi-modal annotations and metadata for 1 million ABC CAD models, namely A2Z, to unlock an unprecedented level of BRep learning. A2Z comprises (i) high-resolution meshes with salient 3D scanning features, (ii) 3D hand-drawn sketches equipped with (iii) geometric and topological information about BRep co-edges, corners, and surfaces, and (iv) textual captions and tags describing the product in the mechanical world. Creating such carefully structured, large-scale data, which requires nearly 5 terabytes of storage to leverage unparalleled CAD learning/retrieval tasks, is very challenging. The scale, quality, and diversity of our multi-modal annotations are assessed using novel metrics, GPT-5, Gemini, and extensive human feedback mechanisms. To this end, we also merge an additional 25,000 CAD models of electronic enclosures (e.g., tablets, ports) designed by skilled professionals with our A2Z dataset. Subsequently, we train and benchmark a foundation model on a subset of 150K CAD models to detect BRep co-edges and corner vertices from 3D scans, a key downstream task in CAD reverse engineering. The annotated dataset, metrics, and checkpoints will be publicly released to support numerous research directions.

104. 【2603.12599】A Prediction-as-Perception Framework for 3D Object Detection

链接https://arxiv.org/abs/2603.12599

作者:Song Zhang,Haoyu Chen,Ruibo Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Humans combine prediction, Humans combine, observe the world, PAP, perception

备注

点击查看摘要

Abstract:Humans combine prediction and perception to observe the world. When faced with rapidly moving birds or insects, we can only perceive them clearly by predicting their next position and focusing our gaze there. Inspired by this, this paper proposes the Prediction-As-Perception (PAP) framework, integrating a prediction-perception architecture into 3D object perception tasks to enhance the model's perceptual accuracy. The PAP framework consists of two main modules: prediction and perception, primarily utilizing continuous frame information as input. Firstly, the prediction module forecasts the potential future positions of ego vehicles and surrounding traffic participants based on the perception results of the current frame. These predicted positions are then passed as queries to the perception module of the subsequent frame. The perceived results are iteratively fed back into the prediction module. We evaluated the PAP structure using the end-to-end model UniAD on the nuScenes dataset. The results demonstrate that the PAP structure improves UniAD's target tracking accuracy by 10% and increases the inference speed by 15%. This indicates that such a biomimetic design significantly enhances the efficiency and accuracy of perception models while reducing computational resource consumption.

105. 【2603.12598】Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

链接https://arxiv.org/abs/2603.12598

作者:Xiangkui Cao,Jie Zhang,Meina Kan,Shiguang Shan,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, shown remarkable potential, Large Vision-Language, finance and healthcare, shown remarkable

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown remarkable potential across a wide array of vision-language tasks, leading to their adoption in critical domains such as finance and healthcare. However, their growing deployment also introduces significant security and privacy risks. Malicious actors could potentially exploit these models to extract sensitive information, highlighting a critical vulnerability. Recent studies show that LVLMs often fail to consistently refuse instructions designed to compromise user privacy. While existing work on privacy protection has made meaningful progress in preventing the leakage of sensitive data, they are constrained by limitations in both generalization and non-destructiveness. They often struggle to robustly handle unseen privacy-related queries and may inadvertently degrade a model's performance on standard tasks. To address these challenges, we introduce Neural Gate, a novel method for mitigating privacy risks through neuron-level model editing. Our method improves a model's privacy safeguards by increasing its rate of refusal for privacy-related questions, crucially extending this protective behavior to novel sensitive queries not encountered during the editing process. Neural Gate operates by learning a feature vector to identify neurons associated with privacy-related concepts within the model's representation of a subject. This localization then precisely guides the update of model parameters. Through comprehensive experiments on MiniGPT and LLaVA, we demonstrate that our method significantly boosts the model's privacy protection while preserving its original utility.

106. 【2603.12588】SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

链接https://arxiv.org/abs/2603.12588

作者:Furui Chen,Han Wang,Yuhan Sun,Jianing You,Yixuan Lv,Zhuang Zhou,Hong Tan,Shengyang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Cross-modal ship re-identification, synthetic aperture radar, coherent active radar, active radar sensing, passive optical imaging

备注

点击查看摘要

Abstract:Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical--SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at this https URL.

107. 【2603.12587】MRGeo: Robust Cross-View Geo-Localization of Corrupted Images via Spatial and Channel Feature Enhancement

链接https://arxiv.org/abs/2603.12587

作者:Le Wu,Lv Bo,Songsong Ouyang,Yingying Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately localize street-view, Cross-view geo-localization, localize street-view images, geo-tagged satellite images, aims to accurately

备注

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) aims to accurately localize street-view images through retrieval of corresponding geo-tagged satellite images. While prior works have achieved nearly perfect performance on certain standard datasets, their robustness in real-world corrupted environments remains under-explored. This oversight causes severe performance degradation or failure when images are affected by corruption such as blur or weather, significantly limiting practical deployment. To address this critical gap, we introduce MRGeo, the first systematic method designed for robust CVGL under corruption. MRGeo employs a hierarchical defense strategy that enhances the intrinsic quality of features and then enforces a robust geometric prior. Its core is the Spatial-Channel Enhancement Block, which contains: (1) a Spatial Adaptive Representation Module that models global and local features in parallel and uses a dynamic gating mechanism to arbitrate their fusion based on feature reliability; and (2) a Channel Calibration Module that performs compensatory adjustments by modeling multi-granularity channel dependencies to counteract information loss. To prevent spatial misalignment under severe corruption, a Region-level Geometric Alignment Module imposes a geometric structure on the final descriptors, ensuring coarse-grained consistency. Comprehensive experiments on both robustness benchmark and standard datasets demonstrate that MRGeo not only achieves an average R@1 improvement of 2.92\% across three comprehensive robustness benchmarks (CVUSA-C-ALL, CVACT\_val-C-ALL, and CVACT\_test-C-ALL) but also establishes superior performance in cross-area evaluation, thereby demonstrating its robustness and generalization capability.

108. 【2603.12579】DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration

链接https://arxiv.org/abs/2603.12579

作者:Youngjin Oh,Junhyeong Kwon,Nam Ik Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ambient light normalization, image understanding capability, ambient light, self-supervised model, light normalization

备注: Submitted to ICPR 2026 (under review)

点击查看摘要

Abstract:This paper presents a new ambient light normalization framework, DINOLight, that integrates the self-supervised model DINOv2's image understanding capability into the restoration process as a visual prior. Ambient light normalization aims to restore images degraded by non-uniform shadows and lighting caused by multiple light sources and complex scene geometries. We observe that DINOv2 can reliably extract both semantic and geometric information from a degraded image. Based on this observation, we develop a novel framework to utilize DINOv2 features for lighting normalization. First, we propose an adaptive feature fusion module that combines features from different DINOv2 layers using a point-wise softmax mask. Next, the fused features are integrated into our proposed restoration network in both spatial and frequency domains through an auxiliary cross-attention mechanism. Experiments show that DINOLight achieves superior performance on the Ambient6K dataset, and that DINOv2 features are effective for enhancing ambient light normalization. We also apply our method to shadow-removal benchmark datasets, achieving competitive results compared to methods that use mask priors. Codes will be released upon acceptance.

109. 【2603.12577】Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

链接https://arxiv.org/abs/2603.12577

作者:Jia-Chen Zhang,Zhen-Wei Yan,Yu-Jie Xiong,Chun-Ming Xia

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:extreme parameter efficiency, Parameter-Efficient Fine-Tuning, multi-task scenarios due, dominant paradigm, paradigm for deploying

备注

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks--where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

110. 【2603.12575】AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

链接https://arxiv.org/abs/2603.12575

作者:Xuanhua Yin,Chuanzhi Xu,Haoxian Zhou,Boyu Wei,Weidong Cai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers, backbone for high-fidelity, generation due, dominant backbone, due to strong

备注: 32 pages, 13 tables, 12 figures

点击查看摘要

Abstract:Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11$\times$ speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at this https URL.

111. 【2603.12557】Lyapunov Stable Graph Neural Flow

链接https://arxiv.org/abs/2603.12557

作者:Haoyu Chu,Xiaotong Chen,Wei Zhou,Wenjun Cui,Kai Zhao,Shikui Wei,Qiyu Kang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:topology and features, making the learning, critical challenge, highly vulnerable, learning of robust

备注

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are highly vulnerable to adversarial perturbations in both topology and features, making the learning of robust representations a critical challenge. In this work, we bridge GNNs with control theory to introduce a novel defense framework grounded in integer- and fractional-order Lyapunov stability. Unlike conventional strategies that rely on resource-heavy adversarial training or data purification, our approach fundamentally constrains the underlying feature-update dynamics of the GNN. We propose an adaptive, learnable Lyapunov function paired with a novel projection mechanism that maps the network's state into a stable space, thereby offering theoretically provable stability guarantees. Notably, this mechanism is orthogonal to existing defenses, allowing for seamless integration with techniques like adversarial training to achieve cumulative robustness. Extensive experiments demonstrate that our Lyapunov-stable graph neural flows substantially outperform base neural flows and state-of-the-art baselines across standard benchmarks and various adversarial attack scenarios.

112. 【2603.12553】Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

链接https://arxiv.org/abs/2603.12553

作者:Minghao Jin,Mozheng Liao,Mingfei Han,Zhihui Li,Xiaojun Chang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:improved robotic manipulation, architectures have improved, predictive visual foresight, improved robotic, robotic manipulation

备注

点击查看摘要

Abstract:Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.

113. 【2603.12551】CVGL: Causal Learning and Geometric Topology

链接https://arxiv.org/abs/2603.12551

作者:Songsong Ouyang,Yingying Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to estimate, aerial image, Geometric Topology Fusion, estimate the geographic, geographic location

备注

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) aims to estimate the geographic location of a street image by matching it with a corresponding aerial image. This is critical for autonomous navigation and mapping in complex real-world scenarios. However, the task remains challenging due to significant viewpoint differences and the influence of confounding factors. To tackle these issues, we propose the Causal Learning and Geometric Topology (CLGT) framework, which integrates two key components: a Causal Feature Extractor (CFE) that mitigates the influence of confounding factors by leveraging causal intervention to encourage the model to focus on stable, task-relevant semantics; and a Geometric Topology Fusion (GT Fusion) module that injects Bird's Eye View (BEV) road topology into street features to alleviate cross-view inconsistencies caused by extreme perspective changes. Additionally, we introduce a Data-Adaptive Pooling (DA Pooling) module to enhance the representation of semantically rich regions. Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions. Our codes are available at this https URL.

114. 【2603.12547】Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

链接https://arxiv.org/abs/2603.12547

作者:Fares Bougourzi,Fadi Dornaika,Abdenour Hadid

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, reaching expert-level accuracy, tumors and tissues, medical image segmentation, learning has achieved

备注

点击查看摘要

Abstract:Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.

115. 【2603.12545】Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

链接https://arxiv.org/abs/2603.12545

作者:Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A.,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:basic spatial reasoning, advanced rapidly, Vision-language models, struggle with basic, Vision-language

备注: Accepted as a poster at ICLR 2026 workshop ICBINB

点击查看摘要

Abstract:Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.

116. 【2603.12538】Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

链接https://arxiv.org/abs/2603.12538

作者:Alaa Dalaq,Muzammil Behzad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Referring image segmentation, aims to produce, produce a pixel-level, pixel-level mask, Referring image

备注

点击查看摘要

Abstract:Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.

117. 【2603.12533】Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

链接https://arxiv.org/abs/2603.12533

作者:Yura Choi,Roy Miles,Rolandos Alexandros Potamias,Ismail Elezi,Jiankang Deng,Stefanos Zafeiriou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, user pointing gesture, current Multimodal Large, Large Language Models, answering questions based

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: this https URL

118. 【2603.12517】Curriculum Sampling: A Two-Phase Curriculum for Efficient Training of Flow Matching

链接https://arxiv.org/abs/2603.12517

作者:Pengwei Sun

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Flow Matching models, Flow Matching, common practice increasingly, practice increasingly favors, increasingly favors static

备注

点击查看摘要

Abstract:Timestep sampling $p(t)$ is a central design choice in Flow Matching models, yet common practice increasingly favors static middle-biased distributions (e.g., Logit-Normal). We show that this choice induces a speed--quality trade-off: middle-biased sampling accelerates early convergence but yields worse asymptotic fidelity than Uniform sampling. By analyzing per-timestep training losses, we identify a U-shaped difficulty profile with persistent errors near the boundary regimes, implying that under-sampling the endpoints leaves fine details unresolved. Guided by this insight, we propose \textbf{Curriculum Sampling}, a two-phase schedule that begins with middle-biased sampling for rapid structure learning and then switches to Uniform sampling for boundary refinement. On CIFAR-10, Curriculum Sampling improves the best FID from $3.85$ (Uniform) to $3.22$ while reaching peak performance at $100$k rather than $150$k training steps. Our results highlight that timestep sampling should be treated as an evolving curriculum rather than a fixed hyperparameter.

119. 【2603.12514】Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

链接https://arxiv.org/abs/2603.12514

作者:Shivam Chaudhary,Sheethal Bhat,Andreas Maier

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:annotated medical data, Masked Image Modeling, emergency radiology, primarily due, Accurate detection

备注: 9 pages, 6 figures, 6 tables. The code is available at [this https URL](https://github.com/shivasmic/3d-trauma-detection-ssl)

点击查看摘要

Abstract:Accurate detection and localization of traumatic injuries in abdominal CT scans remains a critical challenge in emergency radiology, primarily due to severe scarcity of annotated medical data. This paper presents a label-efficient approach combining self-supervised pre-training with semi-supervised detection for 3D medical image analysis. We employ patch-based Masked Image Modeling (MIM) to pre-train a 3D U-Net encoder on 1,206 CT volumes without annotations, learning robust anatomical representations. The pretrained encoder enables two downstream clinical tasks: 3D injury detection using VDETR with Vertex Relative Position Encoding, and multi-label injury classification. For detection, semi-supervised learning with 2,000 unlabeled volumes and consistency regularization achieves 56.57% validation mAP@0.50 and 45.30% test mAP@0.50 with only 144 labeled training samples, representing a 115% improvement over supervised-only training. For classification, expanding to 2,244 labeled samples yields 94.07% test accuracy across seven injury categories using only a frozen encoder, demonstrating immediately transferable self-supervised features. Our results validate that self-supervised pre-training combined with semi-supervised learning effectively addresses label scarcity in medical imaging, enabling robust 3D object detection with limited annotations.

120. 【2603.12513】MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

链接https://arxiv.org/abs/2603.12513

作者:Youngrae Kim,Qixin Hu,C.-C. Jay Kuo,Peter A. Beerel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Autoregressive diffusion enables, real-time frame streaming, diffusion enables real-time, enables real-time frame, Autoregressive diffusion

备注: 9 pages main, 3 pages references, 6 pages appendix. Project page: [this https URL](https://memrope.github.io)

点击查看摘要

Abstract:Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.

121. 【2603.12506】Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

链接https://arxiv.org/abs/2603.12506

作者:Joong Ho Kim,Nicholas Thai,Souhardya Saha Dip,Dong Lao,Keith G. Mills

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:random Gaussian noise, random Gaussian, Naïve PAINE, Diffusion Models, Gaussian noise

备注: Code available at [this https URL](https://github.com/LSU-ATHENA/Naive-PAINE)

点击查看摘要

Abstract:Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.

Comments:
Code available at this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2603.12506 [cs.CV]

(or
arXiv:2603.12506v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.12506

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
122. 【2603.12493】RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

链接https://arxiv.org/abs/2603.12493

作者:Ali Mosleh,Faraz Ali,Fengjia Zhang,Stavros Tsogkas,Junyong Lee,Alex Levinshtein,Michael S. Brown

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:obtaining sensor-specific training, Digital zoom, sensor-specific training data, learning-based super-resolution, RAW sensor images

备注: This paper has been accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations.

123. 【2603.12482】CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

链接https://arxiv.org/abs/2603.12482

作者:Tianshuo Xu,Tiantian Hong,Zhifei Chen,Fei Chao,Ying-cong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires balancing glyph, balancing glyph precision, calligraphy synthesis requires, synthesis requires balancing, Page-level calligraphy synthesis

备注

点击查看摘要

Abstract:Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.

124. 【2603.12478】Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

链接https://arxiv.org/abs/2603.12478

作者:Rujie Wu,Haozhe Zhao,Hai Ci,Yizhou Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal instruction tuning, large mixed image-video, Multimodal instruction, mixed image-video pools, highly uneven

备注

点击查看摘要

Abstract:Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at this https URL.

125. 【2603.12469】Unleashing Video Language Models for Fine-grained HRCT Report Generation

链接https://arxiv.org/abs/2603.12469

作者:Yingying Fang,Huichi Zhou,KinHei Lee,Yijia Wang,Zhenxuan Zhang,Jiahao Huang,Guang Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-Resolution Computed Tomography, Computed Tomography, Generating precise diagnostic, formidable challenge due, high pathological diversity

备注: MICCAI 2026

点击查看摘要

Abstract:Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at this https URL

126. 【2603.12468】Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

链接https://arxiv.org/abs/2603.12468

作者:Alexis Guichemerre,Banafsheh Karimian,Soufiane Belharbi,Natacha Gillet,Nicolas Thome,Pourya Shamsolmoali,Mohammadhadi Shateri,Luke McCaffrey,Eric Granger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly Supervised Object, Supervised Object Localization, Weakly Supervised, Supervised Object, enable joint classification

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. {\small Code: \href{this https URL}{this http URL}}

127. 【2603.12459】Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group

链接https://arxiv.org/abs/2603.12459

作者:Alan Garbarz

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:steerable equivariant convolutional, convolutional neural networks, equivariant convolutional neural, steerable kernel constraint, steerable equivariant

备注: 28 pages. Comments are welcome

点击查看摘要

Abstract:We present an alternative way of solving the steerable kernel constraint that appears in the design of steerable equivariant convolutional neural networks. We find explicit real and complex bases which are ready to use, for different symmetry groups and for feature maps of arbitrary tensor type. A major advantage of this method is that it bypasses the need to numerically or analytically compute Clebsch-Gordan coefficients and works directly with the representations of the input and output feature maps. The strategy is to find a basis of kernels that respect a simpler invariance condition at some point $x_0$, and then \textit{steer} it with the defining equation of steerability to move to some arbitrary point $x=g\cdot x_0$. This idea has already been mentioned in the literature before, but not advanced in depth and with some generality. Here we describe how it works with minimal technical tools to make it accessible for a general audience.

128. 【2603.12433】Revisiting Model Stitching In the Foundation Model Era

链接https://arxiv.org/abs/2603.12433

作者:Zheda Mai,Ke Zhang,Fu-En Wang,Zixiao Ken Wang,Albert Y. C. Chen,Lu Xia,Min Sun,Wei-Lun Chao,Cheng-Hao Kuo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:stitch, representational compatibility, stitch layer, light stitch layer, Vision Foundation Models

备注: Accepted by CVPR 2023

点击查看摘要

Abstract:Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

129. 【2603.12430】Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

链接https://arxiv.org/abs/2603.12430

作者:Jian Jiang,Chenxi Lin,Yiming Gu,Zengyi Qin,Zhitao Zeng,Kun Yuan,Yonghao Long,Xiang Xia,Cheng Yuan,Yuqi Wang,Zijie Yue,Kunyi Yang,Yuting Zhang,Zhu Zhuo,Dian Qin,Xin Wang,NG Chi Fai,Brian Anthony,Daguang Xu,Guy Rosman,Ozanan Meireles,Zizhen Zhang,Nicolas Padoy,Hesheng Wang,Qi Dou,Yueming Jin,Yutong Ban

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scene understanding demands, clinical expertise, surgeons can verify, verify against clinical, Surgical scene understanding

备注

点击查看摘要

Abstract:Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

130. 【2603.12421】A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning

链接https://arxiv.org/abs/2603.12421

作者:Hongyan Wei,Wael AbdAlmageed

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:purely data-driven inductive, models rely heavily, data-driven inductive reasoning, autonomous driving models, driving models rely

备注: Under review. 16 pages, 2 figures

点击查看摘要

Abstract:Existing end-to-end autonomous driving models rely heavily on purely data-driven inductive reasoning. This "black-box" nature leads to a lack of interpretability and absolute safety guarantees in complex, long-tail scenarios. To overcome this bottleneck, we propose a novel neuro-symbolic trajectory planning framework that seamlessly integrates rigorous deductive reasoning into end-to-end neural networks. Specifically, our framework utilizes a Large Language Model (LLM) to dynamically extract scene rules and employs an Answer Set Programming (ASP) solver for deterministic logical arbitration, generating safe and traceable discrete driving decisions. To bridge the gap between discrete symbols and continuous trajectories, we introduce a decision-conditioned decoding mechanism that transforms high-level logical decisions into learnable embedding vectors, simultaneously constraining the planning query and the physical initial velocity of a differentiable Kinematic Bicycle Model (KBM). By combining KBM-generated physical baseline trajectories with neural residual corrections, our approach inherently guarantees kinematic feasibility while ensuring a high degree of transparency. On the nuScenes benchmark, our method comprehensively outperforms the state-of-the-art baseline MomAD, reducing the L2 mean error to 0.57 m, decreasing the collision rate to 0.075%, and optimizing trajectory prediction consistency (TPC) to 0.47 m.

131. 【2603.12409】ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

链接https://arxiv.org/abs/2603.12409

作者:Mattia Bernardi,Chiara Cappellino,Matteo Mosconi,Enver Sangineto,Angelo Porrello,Simone Calderara

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent Open-Vocabulary Object, Object Detection architectures, Open-Vocabulary Object Detection, Grounding DINO, strong zero-shot capabilities

备注

点击查看摘要

Abstract:Although recent Open-Vocabulary Object Detection architectures, such as Grounding DINO, demonstrate strong zero-shot capabilities, their performance degrades significantly under domain shifts. Moreover, many domains of practical interest, such as nighttime or foggy scenes, lack large annotated datasets, preventing direct fine-tuning. In this paper, we introduce Aligned Basis Relocation for Adaptation(ABRA), a method that transfers class-specific detection knowledge from a labeled source domain to a target domain where no training images containing these classes are accessible. ABRA formulates this adaptation as a geometric transport problem in the weight space of a pretrained detector, aligning source and target domain experts to transport class-specific knowledge. Extensive experiments across challenging domain shifts demonstrate that ABRA successfully teleports class-level specialization under multiple adverse conditions. Our code will be made public upon acceptance.

132. 【2603.12388】Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

链接https://arxiv.org/abs/2603.12388

作者:Chenkai Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Practical webcam gaze, webcam gaze tracking, Practical webcam, tracking is constrained, deg

备注: 24 pages, 7 figures. Deployment-oriented landmark-only webcam gaze tracking with browser-capable runtime

点击查看摘要

Abstract:Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.

133. 【2603.12382】SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

链接https://arxiv.org/abs/2603.12382

作者:Mohamad Alansari,Naufal Suryanto,Divya Velayudhan,Sajid Javed,Naoufel Werghi,Muzammal Naseer

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, large language models, Multimodal large, videos remains challenging, consistent reference tracking

备注: Accepted at CVPR 2026; Project page: [this https URL](https://risys-lab.github.io/SPARROW;) Repository: [this https URL](https://github.com/RISys-Lab/SPARROW)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 QA pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 JF on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: this https URL

134. 【2603.12369】Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

链接https://arxiv.org/abs/2603.12369

作者:Ayan Banerjee,Kuntal Thakur,Sandeep Gupta

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalizing image classification, image-based diabetic retinopathy, seizure onset zone, fundus image-based diabetic, resting-state fMRI seizure

备注

点击查看摘要

Abstract:Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.

135. 【2603.12354】Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

链接https://arxiv.org/abs/2603.12354

作者:Tianhao Qian,Zhuoxuan Li,Jinde Cao,Xinli Shi,Hanjie Liu,Leszek Rutkowski

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:Efficient deep learning, learning traditionally relies, deep learning traditionally, Efficient deep, activation awareness

备注: 11 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

136. 【2603.12310】VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

链接https://arxiv.org/abs/2603.12310

作者:Yiwen Song,Tomas Pfister,Yale Song

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:intent remains challenging, complex user intent, user intent remains, aligning their outputs, remains challenging

备注

点击查看摘要

Abstract:Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

137. 【2603.13162】DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

链接https://arxiv.org/abs/2603.13162

作者:Junqi Shi,Ming Lu,Xingchen Li,Anle Ke,Ruiqi Zhang,Zhan Ma

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:recently shown outstanding, prohibitive sampling overhead, outstanding perceptual fidelity, shown outstanding perceptual, recently shown

备注

点击查看摘要

Abstract:Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

138. 【2603.13007】Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning

链接https://arxiv.org/abs/2603.13007

作者:Yamin Arefeen,Sidharth Kumar,Steven Warach,Hamidreza Saber,Jonathan Tamir

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

关键词:Probabilistic Generative Models, Diffusion Probabilistic Generative, Probabilistic Generative, clinical stroke MRI, stroke MRI

备注

点击查看摘要

Abstract:Purpose: To develop a data-efficient strategy for accelerated MRI reconstruction with Diffusion Probabilistic Generative Models (DPMs) that enables faster scan times in clinical stroke MRI when only limited fully-sampled data samples are available. Methods: Our simple training strategy, inspired by the foundation model paradigm, first trains a DPM on a large, diverse collection of publicly available brain MRI data in fastMRI and then fine-tunes on a small dataset from the target application using carefully selected learning rates and fine-tuning durations. The approach is evaluated on controlled fastMRI experiments and on clinical stroke MRI data with a blinded clinical reader study. Results: DPMs pre-trained on approximately 4000 subjects with non-FLAIR contrasts and fine-tuned on FLAIR data from only 20 target subjects achieve reconstruction performance comparable to models trained with substantially more target-domain FLAIR data across multiple acceleration factors. Experiments reveal that moderate fine-tuning with a reduced learning rate yields improved performance, while insufficient or excessive fine-tuning degrades reconstruction quality. When applied to clinical stroke MRI, a blinded reader study involving two neuroradiologists indicates that images reconstructed using the proposed approach from $2 \times$ accelerated data are non-inferior to standard-of-care in terms of image quality and structural delineation. Conclusion: Large-scale pre-training combined with targeted fine-tuning enables DPM-based MRI reconstruction in data-constrained, accelerated clinical stroke MRI. The proposed approach substantially reduces the need for large application-specific datasets while maintaining clinically acceptable image quality, supporting the use of foundation-inspired diffusion models for accelerated MRI in targeted applications.

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

Cite as:
arXiv:2603.13007 [eess.IV]

(or
arXiv:2603.13007v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2603.13007

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yamin Arefeen [view email] [v1]
Fri, 13 Mar 2026 14:13:53 UTC (12,639 KB)

139. 【2603.12951】Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration

链接https://arxiv.org/abs/2603.12951

作者:Riccardo Raciti,Lemuel Puglisi,Francesco Guarnera,Daniele Ravì,Sebastiano Battiato

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, Percentage Brain Volume, Brain Volume Change, Volume Change, Resonance Imaging

备注

点击查看摘要

Abstract:Percentage Brain Volume Change (PBVC) derived from Magnetic Resonance Imaging (MRI) is a widely used biomarker of brain atrophy, with SIENA among the most established methods for its estimation. However, SIENA relies on classical image processing steps, particularly skull stripping and tissue segmentation, whose failures can propagate through the pipeline and bias atrophy estimates. In this work, we examine whether targeted deep learning substitutions can improve SIENA while preserving its established and interpretable framework. To this end, we integrate SynthStrip and SynthSeg into SIENA and evaluate three pipeline variants on the ADNI and PPMI longitudinal cohorts. Performance is assessed using three complementary criteria: correlation with longitudinal clinical and structural decline, scan-order consistency, and end-to-end runtime. Replacing the skull-stripping module yields the most consistent gains: in ADNI, it substantially strengthens associations between PBVC and multiple measures of disease progression relative to the standard SIENA pipeline, while across both datasets it markedly improves robustness under scan reversal. The fully integrated pipeline achieves the strongest scan-order consistency, reducing the error by up to 99.1%. In addition, GPU-enabled variants reduce execution time by up to 46% while maintaining CPU runtimes comparable to standard SIENA. Overall, these findings show that deep learning can meaningfully strengthen established longitudinal atrophy pipelines when used to reinforce their weakest image processing steps. More broadly, this study highlights the value of modularly modernizing clinically trusted neuroimaging tools without sacrificing their interpretability. Code is publicly available at this https URL.

140. 【2603.12800】GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

链接https://arxiv.org/abs/2603.12800

作者:Jiao Wang,Chi Liu,Yiying Zhang,Hongchen Luo,Zhifen Guo,Ying Hu,Ke Xu,Jing Zhou,Hongyan Xu,Ruiting Zhou,Man Tang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:circumpapillary OCT images, ophthalmoscopy fundus images, pattern deviation maps, enabling effective exploitation, dataset comprising scanning

备注

点击查看摘要

Abstract:We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.

141. 【2603.12715】Deep Learning Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging

链接https://arxiv.org/abs/2603.12715

作者:Muhammad Ahmed Khan,Manqiang Peng,Ding Lin,Saif Ur Rehman Khan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:conventional blood-based testing, Regular monitoring, conventional blood-based, blood-based testing, burdensome for frequent

备注

点击查看摘要

Abstract:Regular monitoring of glycemic status is essential for diabetes management, yet conventional blood-based testing can be burdensome for frequent assessment. The sclera contains superficial microvasculature that may exhibit diabetes related alterations and is readily visible on the ocular surface. We propose ScleraGluNet, a multiview deep-learning framework for three-class metabolic status classification (normal, controlled diabetes, and high-glucose diabetes) and continuous fasting plasma glucose (FPG) estimation from multidirectional scleral vessel images. The dataset comprised 445 participants (150/140/155) and 2,225 anterior-segment images acquired from five gaze directions per participant. After vascular enhancement, features were extracted using parallel convolutional branches, refined with Manta Ray Foraging Optimization (MRFO), and fused via transformer-based cross-view attention. Performance was evaluated using subject-wise five-fold cross-validation, with all images from each participant assigned to the same fold. ScleraGluNet achieved 93.8% overall accuracy, with one-vs-rest AUCs of 0.971,0.956, and 0.982 for normal, controlled diabetes, and high-glucose diabetes, respectively. For FPG estimation, the model achieved MAE = 6.42 mg/dL and RMSE = 7.91 mg/dL, with strong correlation to laboratory measurements (r = 0.983; R2 = 0.966). Bland Altman analysis showed a mean bias of +1.45 mg/dL with 95% limits of agreement from -8.33 to +11.23$ mg/dL. These results support multidirectional scleral vessel imaging with multiview learning as a promising noninvasive approach for glycemic assessment, warranting multicenter validation before clinical deployment.

142. 【2603.12581】Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

链接https://arxiv.org/abs/2603.12581

作者:Jianqiang Lin(1 and 2),Zhiqiang Shen(1 and 2),Peng Cao(1, 2 and 3),Jinzhu Yang(1, 2 and 3),Osmar R. Zaiane(4),Xiaoli Liu(5) ((1) Northeastern University, Shenyang, China, (2) Key Laboratory of Intelligent Computing in Medical Image, Shenyang, China, (3) National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Shenyang, China, (4) University of Alberta, Edmonton, Canada, (5) AiShiWeiLai AI Research, Beijing, China)

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, arbitrary missing-modality scenarios, achieved remarkable progress, handling arbitrary missing-modality, multi-modal magnetic resonance

备注

点击查看摘要

Abstract:Although diffusion models have achieved remarkable progress in multi-modal magnetic resonance imaging (MRI) translation tasks, existing methods still tend to suffer from anatomical inconsistencies or degraded texture details when handling arbitrary missing-modality scenarios. To address these issues, we propose a latent diffusion-based multi-modal MRI translation framework, termed MSG-LDM. By leveraging the available modalities, the proposed method infers complete structural information, which preserves reliable boundary details. Specifically, we introduce a style--structure disentanglement mechanism in the latent space, which explicitly separates modality-specific style features from shared structural representations, and jointly models low-frequency anatomical layouts and high-frequency boundary details in a multi-scale feature space. During the structure disentanglement stage, high-frequency structural information is explicitly incorporated to enhance feature representations, guiding the model to focus on fine-grained structural cues while learning modality-invariant low-frequency anatomical representations. Furthermore, to reduce interference from modality-specific styles and improve the stability of structure representations, we design a style consistency loss and a structure-aware loss. Extensive experiments on the BraTS2020 and WMH datasets demonstrate that the proposed method outperforms existing MRI synthesis approaches, particularly in reconstructing complete structures. The source code is publicly available at this https URL.

143. 【2603.12562】Variational Garrote for Sparse Inverse Problems

链接https://arxiv.org/abs/2603.12562

作者:Kanghun Lee,Hyungjoon Soh,Junghyo Jo

类目:Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:solving inverse problems, inverse problems arising, corrupted measurements, plays a central, central role

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Sparse regularization plays a central role in solving inverse problems arising from incomplete or corrupted measurements. Different regularizers correspond to different prior assumptions about the structure of the unknown signal, and reconstruction performance depends on how well these priors match the intrinsic sparsity of the data. This work investigates the effect of sparsity priors in inverse problems by comparing conventional L1 regularization with the Variational Garrote (VG), a probabilistic method that approximates L0 sparsity through variational binary gating variables. A unified experimental framework is constructed across multiple reconstruction tasks including signal resampling, signal denoising, and sparse-view computed tomography. To enable consistent comparison across models with different parameterizations, regularization strength is swept across wide ranges and reconstruction behavior is analyzed through train-generalization error curves. Experiments reveal characteristic bias-variance tradeoff patterns across tasks and demonstrate that VG frequently achieves lower minimum generalization error and improved stability in strongly underdetermined regimes where accurate support recovery is critical. These results suggest that sparsity priors closer to spike-and-slab structure can provide advantages when the underlying coefficient distribution is strongly sparse. The study highlights the importance of prior-data alignment in sparse inverse problems and provides empirical insights into the behavior of variational L0-type methods across different information bottlenecks.

144. 【2603.12445】Unmasking Biases and Reliability Concerns in Convolutional Neural Networks Analysis of Cancer Pathology Images

链接https://arxiv.org/abs/2603.12445

作者:Michael Okonoda,Eder Martinez,Abhilekha Dalal,Lior Shamir

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Convolutional Neural Networks, Convolutional Neural, Neural Networks, shown promising effectiveness, Networks have shown

备注: Electronics, published

点击查看摘要

Abstract:Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93\%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.

145. 【2603.12400】Generation of maximal snake polyominoes using a deep neural network

链接https://arxiv.org/abs/2603.12400

作者:Benjamin Gauthier,Alain Goupil,Fadel Toure

类目:Combinatorics (math.CO); Computer Vision and Pattern Recognition (cs.CV)

关键词:brute force algorithm, specific grid size, Maximal snake polyominoes, Maximal snake, force algorithm

备注: 8-page extended abstract, plus 2 pages of references; 6 figures. Submitted to GASCom 2026

点击查看摘要

Abstract:Maximal snake polyominoes are difficult to study numerically in large rectangles, as computing them requires the complete enumeration of all snakes for a specific grid size, which corresponds to a brute force algorithm. This technique is thus challenging to use in larger rectangles, which hinders the study of maximal snakes. Furthermore, most enumerable snakes lie in small rectangles, making it difficult to study large-scale patterns. In this paper, we investigate the contribution of a deep neural network to the generation of maximal snake polyominoes from a data-driven training, where the maximality and adjacency constraints are not encoded explicitly, but learned. To this extent, we experiment with a denoising diffusion model, which we call Structured Pixel Space Diffusion (SPS Diffusion). We find that SPS Diffusion generalizes from small grids to larger ones, generating valid snakes up to 28x28 squares and producing maximal snake candidates on squares close to the current computational limit. The model is, however, prone to errors such as branching, cycles, or multiple components. Overall, the diffusion model is promising and shows that complex combinatorial objects can be understood by deep neural networks, which is useful in their investigation.

146. 【2603.11850】Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning

链接https://arxiv.org/abs/2603.11850

作者:Johan Andreas Balle Rubak,Sara Haghighat,Sanyam Jain,Mostafa Aldesoki,Akhilanand Chaurasia,Sarah Sadat Ehsani,Faezeh Dehghan Ghanatkaman,Ahmad Badruddin Ghazali,Julien Issa,Basel Khalil,Rishi Ramani,Ruben Pauwels

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:mandibular canal increases, alveolar nerve injury, inferior alveolar nerve, mandibular canal, nerve injury

备注

点击查看摘要

Abstract:Impaction of the mandibular third molar in proximity to the mandibular canal increases the risk of inferior alveolar nerve injury. Panoramic radiography is routinely used to assess this relationship. Automated classification of molar-canal overlap could support clinical triage and reduce unnecessary CBCT referrals, while federated learning (FL) enables multi-center collaboration without sharing patient data. We compared Local Learning (LL), FL, and Centralized Learning (CL) for binary overlap/no-overlap classification on cropped panoramic radiographs partitioned across eight independent labelers. A pretrained ResNet-34 was trained under each paradigm and evaluated using per-client metrics with locally optimized thresholds and pooled test performance with a global threshold. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-based metrics, alongside training dynamics, Grad-CAM visualizations, and server-side aggregate monitoring signals. On the test set, CL achieved the highest performance (AUC 0.831; accuracy = 0.782), FL showed intermediate performance (AUC 0.757; accuracy = 0.703), and LL generalized poorly across clients (AUC range = 0.619-0.734; mean = 0.672). Training curves suggested overfitting, particularly in LL models, and Grad-CAM indicated more anatomically focused attention in CL and FL. Overall, centralized training provided the strongest performance, while FL offers a privacy-preserving alternative that outperforms LL.