本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新1334篇论文,其中:
- 自然语言处理167篇
- 信息检索35篇
- 计算机视觉357篇
自然语言处理
1. 【2603.02208】Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training
链接:https://arxiv.org/abs/2603.02208
作者:Valentin Lacombe,Valentin Quesnel,Damien Sileo
类目:Computation and Language (cs.CL)
关键词:standard pre-training corpora, reasoning, pre-training corpora provide, verifiable symbolic, Reasoning Core
备注: Keywords: LLMs, NLP, Dataset, Corpus, Procedural Pre-training, Reasoning, Logic, Formal Semantics [this https URL](https://github.com/sileod/reasoning_core)
点击查看摘要
Abstract:Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
2. 【2603.02203】ool Verification for Test-Time Reinforcement Learning
链接:https://arxiv.org/abs/2603.02203
作者:Ruotong Liao,Nikolai Röhrich,Xiaohan Wang,Yuhui Zhang,Yasaman Samadzadeh,Volker Tresp,Serena Yeung-Levy
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large reasoning models, Test-time reinforcement learning, self-evolving large reasoning, unlabeled test inputs, enabling online adaptation
备注: 12 pages, 11 figures
点击查看摘要
Abstract:Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
3. 【2603.02176】Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
链接:https://arxiv.org/abs/2603.02176
作者:Hao Li,Chunjiang Mu,Jianhao Chen,Siyue Ren,Zhiyao Cui,Yiqun Zhang,Lei Bai,Shuyue Hu
类目:Computation and Language (cs.CL)
关键词:Claude agent skills, proliferation of Claude, Claude agent, rapid proliferation, raised the central
备注:
点击查看摘要
Abstract:The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set. Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:this https URL.
4. 【2603.02153】Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
链接:https://arxiv.org/abs/2603.02153
作者:Luigi Medrano,Arush Verma,Mukul Chhabra
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, commonly adopt retrieval, higher recall leads, reciprocal rank fusion, systems commonly adopt
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness under realistic production constraints remains underexplored. In this work, we evaluate retrieval fusion in a production-style RAG pipeline operating over an enterprise knowledge base, with fixed retrieval depth, re-ranking budgets, and latency constraints. Across multiple fusion configurations, we find that retrieval fusion does increase raw recall, but these gains are largely neutralized after re-ranking and truncation. In our setting, fusion variants fail to outperform single-query baselines on KB-level Top-$k$ accuracy, with Hit@10 decreasing from $0.51$ to $0.48$ in several configurations. Moreover, fusion introduces additional latency overhead due to query rewriting and larger candidate sets, without corresponding improvements in downstream effectiveness. Our analysis suggests that recall-oriented fusion techniques exhibit diminishing returns once realistic re-ranking limits and context budgets are applied. We conclude that retrieval-level improvements do not reliably translate into end-to-end gains in production RAG systems, and argue for evaluation frameworks that jointly consider retrieval quality, system efficiency, and downstream impact.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2603.02153 [cs.IR]
(or
arXiv:2603.02153v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.02153
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2603.02150】Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)
链接:https://arxiv.org/abs/2603.02150
作者:Miguel Lopez-Duran,Julian Fierrez,Aythami Morales,Daniel DeAlcala,Gonzalo Mancera,Javier Irigoyen,Ruben Tolosana,Oscar Delgado,Francisco Jurado,Alvaro Ortigosa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
关键词:law enforcement agencies, enforcement agencies involved, law enforcement, enforcement agencies, Crime-related Named-Entity Recognition
备注: Sent for review at the main conference of the International Conference of Document Analysis and Recognition (ICDAR) 2026
点击查看摘要
Abstract:The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.
6. 【2603.02146】LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
链接:https://arxiv.org/abs/2603.02146
作者:Guanzheng Chen,Michael Qizhe Shieh,Lidong Bing
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, capabilities of Large, Reinforcement Learning, Language Models
备注: ICLR 2026
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at this https URL.
7. 【2603.02128】LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations
链接:https://arxiv.org/abs/2603.02128
作者:Veronika Solopova,Viktoria Skorik,Maksym Tereshchenko,Alina Haidun,Ostap Vykhopen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Large language models, Large language, strategic decision environments, simulations remains under-researched, geopolitical simulations remains
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
8. 【2603.02112】Recursive Models for Long-Horizon Reasoning
链接:https://arxiv.org/abs/2603.02112
作者:Chenxiao Yang,Nathan Srebro,Zhiyuan Li
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Modern language models, inherent constraint, constraint that poses, poses a fundamental, language models reason
备注:
点击查看摘要
Abstract:Modern language models reason within bounded context, an inherent constraint that poses a fundamental barrier to long-horizon reasoning. We identify recursion as a core principle for overcoming this barrier, and propose recursive models as a minimal realization, where the model can recursively invoke itself to solve subtasks in isolated contexts. We prove that any computable problem admits a recursive decomposition in which each subtask requires only exponentially smaller active context than standard autoregressive models; this strictly surpasses any context management approach confined to a single sequence, such as summarization. We further generalize our framework to modern agentic systems with arbitrary context processing and control flows, and prove that recursive models can achieve optimal power within this broader class. Experimentally, we train a 3B model to reason recursively and evaluate on Boolean satisfiability, a task requiring long-horizon combinatorial search, where it significantly outperforms frontier LLMs.
9. 【2603.02099】Recursive Think-Answer Process for LLMs and VLMs
链接:https://arxiv.org/abs/2603.02099
作者:Byung-Kwan Lee,Youngchae Chee,Yong Man Ro
类目:Computation and Language (cs.CL)
关键词:made notable progress, leveraging interpretable internal, interpretable internal reasoning, Recursive Think-Answer Process, made notable
备注: CVPR 2026 Findings, Project page: [this https URL](https://litcoderr.github.io/rtap_page/)
点击查看摘要
Abstract:Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
10. 【2603.02098】OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
链接:https://arxiv.org/abs/2603.02098
作者:Chuong Huynh,Manh Luong,Abhinav Shrivastava
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:retrieve desired targets, desired targets, retrieval, retrieve desired, Multimodal retrieval
备注: CVPR 2026. Project link: [this https URL](https://github.com/hmchuong/omniret)
点击查看摘要
Abstract:Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model's omni-modal embedding capacity.
11. 【2603.02097】ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels
链接:https://arxiv.org/abs/2603.02097
作者:Xiang Zheng,Han Li,Wenjie Luo,Weiqi Zhai,Yiyuan Li,Chuanmiao Yan,Tianyi Tang,Yubo Ma,Kexin Yang,Dayiheng Liu,Hu Wei,Bing Zhao
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, health management, showing promise, increasingly applied
备注: 8 pages, 6 figures,
点击查看摘要
Abstract:Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
12. 【2603.02091】Learning from Synthetic Data Improves Multi-hop Reasoning
链接:https://arxiv.org/abs/2603.02091
作者:Anmol Kabra,Yilun Yin,Albert Gong,Kamilė Stankevičiūtė,Dongyoung Go,Johann Lee,Katie Z. Luo,Carla P. Gomes,Kilian Q. Weinberger
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reinforcement Learning, large language models, language models, large language, multi-hop reasoning tasks
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
13. 【2603.02084】Modeling Grammatical Hypothesis Testing in Young Learners: A Sequence-Based Learning Analytics Study of Morphosyntactic Reasoning in an Interactive Game
链接:https://arxiv.org/abs/2603.02084
作者:Thierry Geoffre,Trystan Geoffre
类目:Computation and Language (cs.CL)
关键词:interactive game targeting, game targeting morphosyntactic, targeting morphosyntactic agreement, agreement in French, leveraging fine-grained action
备注:
点击查看摘要
Abstract:This study investigates grammatical reasoning in primary school learners through a sequence-based learning analytics approach, leveraging fine-grained action sequences from an interactive game targeting morphosyntactic agreement in French. Unlike traditional assessments that rely on final answers, we treat each slider movement as a hypothesis-testing action, capturing real-time cognitive strategies during sentence construction. Analyzing 597 gameplay sessions (9,783 actions) from 100 students aged 8-11 in authentic classroom settings, we introduce Hamming distance to quantify proximity to valid grammatical solutions and examine convergence patterns across exercises with varying levels of difficulty. Results reveal that determiners and verbs are key sites of difficulty, with action sequences deviating from left-to-right usual treatment. This suggests learners often fix the verb first and adjust preceding elements. Exercises with fewer solutions exhibit slower and more erratic convergence, while changes in the closest valid solution indicate dynamic hypothesis revision. Our findings demonstrate how sequence-based analytics can uncover hidden dimensions of linguistic reasoning, offering a foundation for real-time scaffolding and teacher-facing tools in linguistically diverse classrooms.
14. 【2603.02082】What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
链接:https://arxiv.org/abs/2603.02082
作者:Zhenghao Herbert Zhou,William Dai,Maya Viswanathan,Simon Charlow,R. Thomas McCoy,Robert Frank
类目:Computation and Language (cs.CL)
关键词:innate grammatical knowledge, child-directed speech suffices, grammatical knowledge, speech suffices, depend on innate
备注:
点击查看摘要
Abstract:Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
15. 【2603.02081】GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
链接:https://arxiv.org/abs/2603.02081
作者:Jiale Lao,Immanuel Trummer
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Traditional query processing, Traditional query, query processing relies, carefully optimized, optimized and engineered
备注:
点击查看摘要
Abstract:Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems. We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources. We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads. We use queries from the well-known TPC-H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state-of-the-art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges.
Subjects:
Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as:
arXiv:2603.02081 [cs.DB]
(or
arXiv:2603.02081v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2603.02081
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2603.02070】Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
链接:https://arxiv.org/abs/2603.02070
作者:Guilhem Fouilhé,Rebecca Eifler,Antonin Poché,Sylvie Thiébaux,Nicholas Asher
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
关键词:sequential decision problem, automating plan generation, real-world sequential decision, human planner, replace the human
备注: Preprint
点击查看摘要
Abstract:When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users' questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
17. 【2603.02041】EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
链接:https://arxiv.org/abs/2603.02041
作者:Aleksei Dorkin,Taido Purason,Emil Kalbaliyev,Hele-Andra Kuulmets,Marii Ojastu,Mark Fišel,Tanel Alumäe,Eleri Aedmaa,Krister Kruusmaa,Kairit Sirts
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, trained on English-centric, resulting in uneven, smaller languages
备注:
点击查看摘要
Abstract:Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
18. 【2603.02026】Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
链接:https://arxiv.org/abs/2603.02026
作者:Simon Ging(1 and 2),Philipp Arnold(3),Sebastian Walter(4),Hani Alnahas(1),Hannah Bast(4),Elmar Kotter(3),Jiancheng Yang(5 and 6),Behzad Bozorgtabar(2),Thomas Brox(1) ((1) Computer Vision Group, University of Freiburg, Germany, (2) Adaptive amp; Agentic AI (A3) Lab, Aarhus University, Denmark, (3) Department of Radiology, Medical Center -- University of Freiburg, Germany, (4) Chair of Algorithms and Data Structures, University of Freiburg, Germany, (5) ELLIS Institute Finland, (6) School of Electrical Engineering, Aalto University, Finland)
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:limited public data, coarse global supervision, models align volumes, vision-language models align, align volumes
备注:
点击查看摘要
Abstract:Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
19. 【2603.02024】MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
链接:https://arxiv.org/abs/2603.02024
作者:Jiachun Li,Shaoping Huang,Zhuoran Jin,Chenlong Zhang,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, multimodal large language, large language models, reasoning, large language
备注: Accepted by ICLR 2026, 78 pages, 60 figures
点击查看摘要
Abstract:Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
20. 【2603.02023】PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
链接:https://arxiv.org/abs/2603.02023
作者:He Li,Feichen Song,Boyi Zeng,Shixiang Song,Zhiqin John Xu,Ziwei He,Zhouhan Lin
类目:Computation and Language (cs.CL)
关键词:improve generation quality, natural follow-up question, Test-time scaling, generation quality, motivating a natural
备注:
点击查看摘要
Abstract:Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.
21. 【2603.01990】According to Me: Long-Term Personalized Referential Memory QA
链接:https://arxiv.org/abs/2603.01990
作者:Jingbiao Mei,Jinghong Chen,Guangyu Yang,Xinyu Hou,Margaret Li,Bill Byrne
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:naturally spans multiple, spans multiple modalities, long-term user memory, existing Long-term Memory, long-term user
备注: Preprint
点击查看摘要
Abstract:Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: this https URL
22. 【2603.01973】CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
链接:https://arxiv.org/abs/2603.01973
作者:Yixin Nie,Lin Guan,Zhongyao Ma,Anchit Gupta,Yipin Zhou,Xiao Li,Zhengping Zhou,Raymond Zeng,Gelin Zhou,Shigan Chu,Ajay Thampi,Wancen Mu,Nathan Shuster,Ketong Wang,Lin Chen,Jason Brewer,Derek Hao Hu,Alexander McCauley,Jason Weston,Sem Park,Na Zhang,Kevin Tang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
关键词:improving large language, large language models, report presents CharacterFlywheel, iterative flywheel process, report presents
备注:
点击查看摘要
Abstract:This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
23. 【2603.01966】AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
链接:https://arxiv.org/abs/2603.01966
作者:Cheng Jiayang,Dongyu Ru,Lin Qiu,Yiyang Li,Xuezhi Cao,Yangqiu Song,Xunliang Cai
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLM-based assistants necessitate, current approaches face, approaches face challenges, Long-horizon interactions, face challenges
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
24. 【2603.01950】Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
链接:https://arxiv.org/abs/2603.01950
作者:Christopher Driggers-Ellis,Nachiketh Tibrewal,Rohit Bogulla,Harsh Khanna,Sangpil Youm,Christan Grant,Bonnie Dorr
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:manga would introduce, medium of storytelling, visually impaired users, visually impaired, impaired users
备注: 8 pages, 2 figures, 3 tables. Includes link to code
点击查看摘要
Abstract:A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
25. 【2603.01945】When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
链接:https://arxiv.org/abs/2603.01945
作者:Thibault Prouteau,Francis Lareau,Nicolas Dugué,Jean-Charles Lamirel,Christophe Malaterre
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:quality remains challenging, uncover latent thematic, latent thematic structures, models uncover latent, remains challenging
备注:
点击查看摘要
Abstract:Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics like topic coherence and diversity, which may not fully align with human judgment. Human evaluation tasks, such as word intrusion, provide valuable insights but are costly and primarily validated on general-domain corpora. This paper introduces Topic Word Mixing (TWM), a novel human evaluation task assessing inter-topic distinctness by testing whether annotators can distinguish between word sets from single or mixed topics. TWM complements word intrusion's focus on intra-topic coherence and provides a human-grounded counterpart to diversity metrics. We evaluate six topic models - both statistical and embedding-based (LDA, NMF, Top2Vec, BERTopic, CFMF, CFMF-emb) - comparing automated metrics with human evaluation methods based on nearly 4,000 annotations from a domain-specific corpus of philosophy of science publications. Our findings reveal that word intrusion and coherence metrics do not always align, particularly in specialized domains, and that TWM captures human-perceived distinctness while appearing to align with diversity metrics. We release the annotated dataset and task generation code. This work highlights the need for evaluation frameworks bridging automated and human assessments, particularly for domain-specific corpora.
26. 【2603.01930】From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation
链接:https://arxiv.org/abs/2603.01930
作者:Junbo Huang,Max Weinig,Ulrich Fritsche,Ricardo Usbeck
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:shaping public understanding, Natural Language Processing, discourse play, play a critical, critical role
备注: LREC 2026 Accepted Paper
点击查看摘要
Abstract:Narratives in news discourse play a critical role in shaping public understanding of economic events, such as inflation. Annotating and evaluating these narratives in a structured manner remains a key challenge for Natural Language Processing (NLP). In this work, we introduce a narrative graph annotation framework that integrates principles from qualitative content analysis (QCA) to prioritize annotation quality by reducing annotation errors. We present a dataset of inflation narratives annotated as directed acyclic graphs (DAGs), where nodes represent events and edges encode causal relations. To evaluate annotation quality, we employed a $6\times3$ factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's $\alpha$), capturing the presence of human label variation (HLV) in narrative interpretations. Our analysis shows that (1) lenient metrics (overlap-based distance) overestimate reliability, and (2) locally-constrained representations (e.g., one-hop neighbors) reduce annotation variability. Our annotation and implementation of graph-based Krippendorrf's $\alpha$ are open-sourced. The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.
27. 【2603.01914】AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
链接:https://arxiv.org/abs/2603.01914
作者:Shixiang Song,He Li,Zitong Wang,Boyi Zeng,Feichen Song,Yixuan Wang,Zhiqin John Xu,Ziwei He,Zhouhan Lin
类目:Computation and Language (cs.CL)
关键词:iterative Transformers enables, Transformers enables large, lacking token-wise adaptivity, iterative Transformers, Test-time scaling
备注:
点击查看摘要
Abstract:Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
28. 【2603.01912】Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
链接:https://arxiv.org/abs/2603.01912
作者:Yinghao Tang,Yupeng Xie,Yingchaojie Feng,Tingfeng Lan,Wei Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:web development skills, ideas through exploration, remains costly, requiring both domain, development skills
备注:
点击查看摘要
Abstract:Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at this https URL.
29. 【2603.01910】FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
链接:https://arxiv.org/abs/2603.01910
作者:Liliia Bogdanova,Shiran Sun,Lifeng Han,Natalia Amat Lefort,Flor Miriam Plaza-del-Arco
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Short Answer Questions, system paper describes, Everyday Knowledge, Diverse Languages, paper describes
备注:
点击查看摘要
Abstract:This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures''. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via this https URL
30. 【2603.01907】Efficient RLVR Training via Weighted Mutual Information Data Selection
链接:https://arxiv.org/abs/2603.01907
作者:Xinyu Zhou,Boyu Zhu,Haotian Zhang,Huiming Wang,Zhijiang Guo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, efficiency critically depends, plays a central, language models, central role
备注: 15 Pages
点击查看摘要
Abstract:Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
31. 【2603.01875】KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
链接:https://arxiv.org/abs/2603.01875
作者:Songming Zhang,Xue Zhang,Tong Zhang,Bojie Hu,Yufeng Chen,Jinan Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:compress large language, large language models, Knowledge distillation, essential technique, technique to compress
备注: 8 pages, 4 figures, 3 tables, code is available at: [this https URL](https://github.com/songmzhang/KDFlow)
点击查看摘要
Abstract:Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: this https URL
32. 【2603.01869】Sovereign AI-based Public Services are Viable and Affordable
链接:https://arxiv.org/abs/2603.01869
作者:António Branco,Luís Gomes,Rodrigo Santos,Eduardo Santos,João Silva,Nuno Marques,Madalena Rodrigues
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:growing structural concentration, infrastructure and expertise, rapid expansion, intensified debates, long-term implications
备注: Accepted at LREC 2026
点击查看摘要
Abstract:The rapid expansion of AI-based remote services has intensified debates about the long-term implications of growing structural concentration in infrastructure and expertise. As AI capabilities become increasingly intertwined with geopolitical interests, the availability and reliability of foundational AI services can no longer be taken for granted. This issue is particularly pressing for AI-enabled public services for citizens, as governments and public agencies are progressively adopting 24/7 AI-driven support systems typically operated through commercial offerings from a small oligopoly of global technology providers. This paper challenges the prevailing assumption that general-purpose architectures, offered by these providers, are the optimal choice for all application contexts. Through practical experimentation, we demonstrate that viable and cost-effective alternatives exist. Alternatives that align with principles of digital and cultural sovereignty. Our findings provide an empirical illustration that sovereign AI-based public services are both technically feasible and economically sustainable, capable of operating effectively on premises with modest computational and financial resources while maintaining cultural and digital autonomy. The technical insights and deployment lessons reported here are intended to inform the adoption of similar sovereign AI public services by national agencies and governments worldwide.
33. 【2603.01865】CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
链接:https://arxiv.org/abs/2603.01865
作者:Ziyi Zhu,Olivier Tieleman,Alexey Bukhtiyarov,Jinghong Chen
类目:Computation and Language (cs.CL)
关键词:open-ended model assessment, exhibit systematic biases, judges exhibit systematic, standard practice, practice for open-ended
备注:
点击查看摘要
Abstract:LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.
34. 【2603.01853】Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering
链接:https://arxiv.org/abs/2603.01853
作者:Xufei Lv,Jiahui Yang,Yifu Gao,Linbo Qiao,Houde Liu
类目:Computation and Language (cs.CL)
关键词:demands multi-hop reasoning, Temporal Knowledge Graph, Knowledge Graph Question, demands multi-hop, temporal question answering
备注:
点击查看摘要
Abstract:Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on this https URL
35. 【2603.01824】OpenAutoNLU: Open Source AutoML Library for NLU
链接:https://arxiv.org/abs/2603.01824
作者:Grigory Arshinov,Aleksandr Boriskin,Sergey Senichev,Ayaz Zaripov,Daria Galimzianova,Daniil Karpov,Leonid Sanochkin
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:named entity recognition, open-source automated machine, automated machine learning, natural language understanding, machine learning library
备注:
点击查看摘要
Abstract:OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here this https URL.
36. 【2603.01795】PleaSQLarify: Visual Pragmatic Repair for Natural Language Database Querying
链接:https://arxiv.org/abs/2603.01795
作者:Robin Shing Moon Chan,Rita Sevastjanova,Mennatallah El-Assady
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:broaden data access, database interfaces broaden, interfaces broaden data, data access, broaden data
备注: Accepted at CHI'26, main track
点击查看摘要
Abstract:Natural language database interfaces broaden data access, yet they remain brittle under input ambiguity. Standard approaches often collapse uncertainty into a single query, offering little support for mismatches between user intent and system interpretation. We reframe this challenge through pragmatic inference: while users economize expressions, systems operate on priors over the action space that may not align with the users'. In this view, pragmatic repair -- incremental clarification through minimal interaction -- is a natural strategy for resolving underspecification. We present \textsc{PleaSQLarify}, which operationalizes pragmatic repair by structuring interaction around interpretable decision variables that enable efficient clarification. A visual interface complements this by surfacing the action space for exploration, requesting user disambiguation, and making belief updates traceable across turns. In a study with twelve participants, \textsc{PleaSQLarify} helped users recognize alternative interpretations and efficiently resolve ambiguity. Our findings highlight pragmatic repair as a design principle that fosters effective user control in natural language interfaces.
37. 【2603.01792】ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
链接:https://arxiv.org/abs/2603.01792
作者:Xunlei Chen,Jinyu Guo,Yuang Li,Zhaokun Wang,Yi Gong,Jie Zou,Jiwei Wei,Wenhong Tian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, encompass extensive knowledge, diverse domains, advanced to encompass
备注: Accepted at The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
点击查看摘要
Abstract:Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs' billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.
38. 【2603.01791】Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
链接:https://arxiv.org/abs/2603.01791
作者:Fred Zimmerman
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:compression progress theory, English-language publishing, apply Schmidhuber compression, Schmidhuber compression progress, centuries of English-language
备注: 12 pages, 4 figures, 5 tables
点击查看摘要
Abstract:I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at this https URL.
39. 【2603.01788】nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
链接:https://arxiv.org/abs/2603.01788
作者:Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
类目:Computation and Language (cs.CL)
关键词:Self-Consistent Structured Generation, Dimensional Aspect-Based Sentiment, Aspect-Based Sentiment Analysis, present Self-Consistent Structured, Structured Generation
备注:
点击查看摘要
Abstract:We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM's PagedAttention mechanism for efficient key--value cache reuse. Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
40. 【2603.01778】LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
链接:https://arxiv.org/abs/2603.01778
作者:Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
类目:Computation and Language (cs.CL)
关键词:Aspect-Based Sentiment Analysis, Large Language Model, tasks requires manually, Sentiment Analysis, Training models
备注: Accepted for publication at LREC 2026. Final version will appear in the ACL Anthology
点击查看摘要
Abstract:Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.
41. 【2603.01776】FreeAct: Freeing Activations for LLM Quantization
链接:https://arxiv.org/abs/2603.01776
作者:Xiaohao Liu,Xiaobo Xia,Manyi Zhang,Ji-Fu Li,Xianzhi Yu,Fei Shen,Xiu Su,See-Kiong Ng,Tat-Seng Chua
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Language Models, Large Language, overhead of Large, pivotal for mitigating
备注: 26 pages, 18 figures, 2 tables
点击查看摘要
Abstract:Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
42. 【2603.01775】Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation
链接:https://arxiv.org/abs/2603.01775
作者:Harry Stuart,Masahiro Kaneko,Timothy Baldwin
类目:Computation and Language (cs.CL)
关键词:https URL, Effective hiring, technical manager, deploy at scale, challenging to find
备注:
点击查看摘要
Abstract:Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale. Therefore, automated resume scoring and other applicant-screening methods are increasingly used to coarsely filter candidates, making decisions on limited information. We propose that large language models (LLMs) can play the role of subject matter experts to cost-effectively elicit information from each candidate that is nuanced and role-specific, thereby improving the quality of early-stage hiring decisions. We present a system that leverages an LLM interviewer to update belief over an applicant's rubric-oriented latent traits in a calibrated way. We evaluate our system on simulated interviews and show that belief converges towards the simulated applicants' artificially-constructed latent ability levels. We release code, a modest dataset of public-domain/anonymised resumes, belief calibration tests, and simulated interviews, at \href{this https URL}{this https URL}. Our demo is available at \href{this https URL}{this https URL}.
43. 【2603.01773】AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions
链接:https://arxiv.org/abs/2603.01773
作者:Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
类目:Computation and Language (cs.CL)
关键词:Aspect-Based Sentiment Analysis, Sentiment Analysis, web-based annotation tool, Large Language Model, support the full
备注: Accepted for publication at LREC 2026. Final version will appear in the ACL Anthology
点击查看摘要
Abstract:We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.
44. 【2603.01732】Bootstrapping Embeddings for Low Resource Languages
链接:https://arxiv.org/abs/2603.01732
作者:Merve Basoz,Andrew Horne,Mattia Opper
类目:Computation and Language (cs.CL)
关键词:modern NLP, crucial to modern, NLP, Embedding models, models
备注: (v1 - LowResLM Camera Ready)
点击查看摘要
Abstract:Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
45. 【2603.01714】opoCurate:Modeling Interaction Topology for Tool-Use Agent Training
链接:https://arxiv.org/abs/2603.01714
作者:Jinluan Yang,Yuxin Liu,Zhengyu Chen,Chengcheng Han,Yueqing Sun,Qi Gu,Hui Su,Xunliang Cai,Fei Wu,Kun Kuang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Training tool-use agents, Reinforcement Learning, tool-use agents typically, agents typically relies, Supervised Fine-Tuning
备注: Under Review
点击查看摘要
Abstract:Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.
46. 【2603.01710】Legal RAG Bench: an end-to-end benchmark for legal RAG
链接:https://arxiv.org/abs/2603.01710
作者:Abdur-Rahman Butler,Umar Butler
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Legal RAG Bench, Legal RAG, legal RAG systems, RAG Bench, introduce Legal RAG
备注: 13 pages, 3 figures, 4 tables
点击查看摘要
Abstract:We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
47. 【2603.01691】Building a Strong Instruction Language Model for a Less-Resourced Language
链接:https://arxiv.org/abs/2603.01691
作者:Domen Vreš,Tjaša Arčon,Timotej Petrič,Dario Vajda,Marko Robnik-Šikonja,Iztok Lebar Bajec
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, natural language processing, Slovene LLM arena, Large language, Slovene
备注: Currently under review at Natural Language Processing Special Issue on Language Models for Low-Resource Languages
点击查看摘要
Abstract:Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
48. 【2603.01690】QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions
链接:https://arxiv.org/abs/2603.01690
作者:Yixuan Tang,Zhenghong Lin,Yandong Sun,Anthony K.H. Tung
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:black-box nature limits, clinical decision-making, nature limits, limits their utility, utility in clinical
备注:
点击查看摘要
Abstract:While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.
49. 【2603.01683】Surgical Post-Training: Cutting Errors, Keeping Knowledge
链接:https://arxiv.org/abs/2603.01683
作者:Wenye Lin,Kai Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Direct Preference Optimization, capabilities of Large, Language Models
备注: 15 pages
点击查看摘要
Abstract:Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: this https URL
50. 【2603.01666】Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
链接:https://arxiv.org/abs/2603.01666
作者:Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Jiahao Huo,Yu Huang,James Kwok,Xuming Hu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Visual Document Retrieval, visually-rich documents requires, Harnessing the full, challenge in Visual, documents requires retrieval
备注: Under review
点击查看摘要
Abstract:Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
51. 【2603.01651】LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence
链接:https://arxiv.org/abs/2603.01651
作者:Anka Chandrahas Tummepalli,Preethu Rose Anish
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:predicting judicial outcomes, judicial outcomes demands, outcomes demands nuanced, demands nuanced analysis, Understanding and predicting
备注: Published in AILaw @ AAAI 2026 Conference
点击查看摘要
Abstract:Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.
52. 【2603.01639】Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
链接:https://arxiv.org/abs/2603.01639
作者:Jiebin Zhang,Zhenghan Yu,Liang Wang,Nan Yang,Eugene J. Yu,Zheng Li,Yifan Song,Dawei Zhu,Xingxing Zhang,Furu Wei,Sujian Li
类目:Computation and Language (cs.CL)
关键词:large language model, larger target model, Speculative decoding accelerates, accelerates large language, small draft model
备注: 22pages, 7 figures
点击查看摘要
Abstract:Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
53. 【2603.01625】Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation
链接:https://arxiv.org/abs/2603.01625
作者:Aditya Parikh,Aasa Feragen,Sneha Das,Stella Frank
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:radiology requires validation, Reliable deployment, surface-level text similarity, requires validation metrics, ensure clinical fidelity
备注: This is an extended version of a manuscript currently under review
点击查看摘要
Abstract:Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.
54. 【2603.01622】More Data, Fewer Diacritics: Scaling Arabic TTS
链接:https://arxiv.org/abs/2603.01622
作者:Ahmed Musleh,Yifan Zhang,Kareem Darwish
类目:Computation and Language (cs.CL)
关键词:Arabic TTS training, Arabic TTS, exploring Arabic TTS, Arabic TTS model, Arabic
备注:
点击查看摘要
Abstract:Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.
55. 【2603.01580】Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
链接:https://arxiv.org/abs/2603.01580
作者:Arghodeep Nandi,Ojasva Saxena,Tanmoy Chakraborty
类目:Computation and Language (cs.CL)
关键词:automated fact checking, mathematical problem solving, generative language models, fact checking, Reasoning traces produced
备注:
点击查看摘要
Abstract:Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
56. 【2603.01550】Extracting Training Dialogue Data from Large Language Model based Task Bots
链接:https://arxiv.org/abs/2603.01550
作者:Shuo Zhang,Junzhou Zhao,Junji Hou,Pinghui Wang,Chenxu Wang,Jing Tao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, complex language patterns, Large Language, Language Models, modeling complex language
备注: Accepted for publication in IEEE Transactions on Information Forensics and Security (TIFS). \c{opyright} 2026 IEEE
点击查看摘要
Abstract:Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.
57. 【2603.01502】Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs
链接:https://arxiv.org/abs/2603.01502
作者:Ming-Hao Hsu,Xueyao Zhang,Xiaohai Tian,Jun Zhang,Zhizheng Wu
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Large Speech-Language Models, Recent advancements, advancements in Large, Large Speech-Language, linguistic understanding
备注:
点击查看摘要
Abstract:Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.
58. 【2603.01464】ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning
链接:https://arxiv.org/abs/2603.01464
作者:Congying Liu,Taihao Li,Ming Huang,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:require accurate reasoning, disease-related variant analysis, analysis tasks arising, Protein, clinical research
备注:
点击查看摘要
Abstract:Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.
59. 【2603.01457】Power Echoes: Investigating Moderation Biases in Online Power-Asymmetric Conflicts
链接:https://arxiv.org/abs/2603.01457
作者:Yaqiong Li,Peng Zhang,Peixu Hou,Kainan Tu,Guangping Zhang,Shan Qu,Wenshi Chen,Yan Chen,Ning Gu,Tun Lu
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:Online power-asymmetric conflicts, Online power-asymmetric, power-asymmetric conflicts, platforms rely, moderation
备注: Accepted at the ACM CHI conference on Human Factors in Computing Systems (ACM CHI 2026)
点击查看摘要
Abstract:Online power-asymmetric conflicts are prevalent, and most platforms rely on human moderators to conduct moderation currently. Previous studies have been continuously focusing on investigating human moderation biases in different scenarios, while moderation biases under power-asymmetric conflicts remain unexplored. Therefore, we aim to investigate the types of power-related biases human moderators exhibit in power-asymmetric conflict moderation (RQ1) and further explore the influence of AI's suggestions on these biases (RQ2). For this goal, we conducted a mixed design experiment with 50 participants by leveraging the real conflicts between consumers and merchants as a scenario. Results suggest several biases towards supporting the powerful party within these two moderation modes. AI assistance alleviates most biases of human moderation, but also amplifies a few. Based on these results, we propose several insights into future research on human moderation and human-AI collaborative moderation systems for power-asymmetric conflicts.
60. 【2603.01455】From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
链接:https://arxiv.org/abs/2603.01455
作者:Niu Lian,Yuting Wang,Hanshu Yao,Jinpeng Wang,Bin Chen,Yaowei Wang,Min Zhang,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:impressive short-term reasoning, human cognitive efficiency, large language models, demonstrated impressive short-term, long-horizon video understanding
备注: TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary. 16 pages, 7 figures, 7 tables
点击查看摘要
Abstract:While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at this https URL.
61. 【2603.01438】Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
链接:https://arxiv.org/abs/2603.01438
作者:Yuxin Liu,Mingye Zhu,Siyuan Liu,Bo Hu,Lei Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Role-Playing Language Agents, Language Models, Large Language, Language Agents
备注: ICLR 2026
点击查看摘要
Abstract:The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona's influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
62. 【2603.01426】Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
链接:https://arxiv.org/abs/2603.01426
作者:Samhruth Ananthanarayanan,Ayan Sengupta,Tanmoy Chakraborty
类目:Computation and Language (cs.CL)
关键词:dominant memory bottleneck, methods claiming 80-90, minimal benchmark degradation, recent methods claiming, memory bottleneck
备注:
点击查看摘要
Abstract:As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
63. 【2603.01425】LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
链接:https://arxiv.org/abs/2603.01425
作者:Jiajie Jin,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Yutao Zhu,Zhicheng Dou
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:fundamentally transformed dense, generative architectures, transformed dense retrieval, fundamentally transformed, discriminative encoders
备注: Under Review
点击查看摘要
Abstract:LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
64. 【2603.01423】Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
链接:https://arxiv.org/abs/2603.01423
作者:Jiyoon Myung
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, mixed-topic conversations, prior context
备注: Accepted at the Workshop on Assessing and Improving Reliability of Foundation Models in the Real World (AAAI 2026)
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.
65. 【2603.01421】SciDER: Scientific Data-centric End-to-end Researcher
链接:https://arxiv.org/abs/2603.01421
作者:Ke Lin,Yilin Lu,Shreyas Bhat,Xuehang Guo,Junier Oliva,Qingyun Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Automated scientific discovery, autonomously process raw, large language models, existing agents struggle, process raw data
备注: 10 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data, generate hypotheses and experimental designs grounded in specific data characteristics, and write and execute corresponding code. Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models through its self-evolving memory and critic-led feedback loop. Distributed as a modular Python package, we also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research and aim to be accessible to all researchers and developers.
66. 【2603.01385】oward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning
链接:https://arxiv.org/abs/2603.01385
作者:Zhongjian Zhang,Xiao Wang,Mengmei Zhang,Jiarui Tan,Chuan Shi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, graph foundation model, generalizes diverse scenarios, foundation model, motivated researchers
备注: accepted by WWW 2026
点击查看摘要
Abstract:The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.
67. 【2603.01382】End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation
链接:https://arxiv.org/abs/2603.01382
作者:Minghui Wu,Haitao Tang,Jiahuan Fan,Ruizhi Liao,Yanyong Zhang
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:Dysarthric speech reconstruction, automatic speech recognition, convert dysarthric speech, combines automatic speech, typically employs
备注: Submitted to 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
点击查看摘要
Abstract:Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: this https URL
68. 【2603.01369】DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement
链接:https://arxiv.org/abs/2603.01369
作者:Minghui Wu,Xueling Liu,Jiahuan Fan,Haitao Tang,Yanyong Zhang,Yue Zhang
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:significant speaker variability, presenting persistent challenges, exhibits abnormal prosody, speech exhibits abnormal, speaker variability
备注: Submitted to 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
点击查看摘要
Abstract:Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.
69. 【2603.01366】NM-DEKL$^3_\infty$: A Three-Layer Non-Monotone Evolving Dependent Type Logic
链接:https://arxiv.org/abs/2603.01366
作者:Peng Chen
类目:Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
关键词:Dependent Knowledge-Enhanced Logic, Non-Monotone Dependent Knowledge-Enhanced, dependent type system, formalising evolving knowledge, Knowledge-Enhanced Logic
备注:
点击查看摘要
Abstract:We present a new dependent type system, NM-DEKL$^3_\infty$ (Non-Monotone Dependent Knowledge-Enhanced Logic), for formalising evolving knowledge in dynamic environments. The system uses a three-layer architecture separating a computational layer, a constructive knowledge layer, and a propositional knowledge layer. We define its syntax and semantics and establish Soundness and Equational Completeness; we construct a syntactic model and prove that it is initial in the category of models, from which equational completeness follows. We also give an embedding into the $\mu$-calculus and a strict expressiveness inclusion (including the expressibility of non-bisimulation-invariant properties).
70. 【2603.01353】Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain
链接:https://arxiv.org/abs/2603.01353
作者:Yuma Okochi,Fabio Milentiansen Sim,Tomoyasu Okada
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reasoning ability remains, urgent challenge, adapting LLMs, LLMs to specific, ability remains
备注: 8 pages, 2 figures. Japanese version published in NLP2026
点击查看摘要
Abstract:In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on this https URL .
71. 【2603.01343】PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
链接:https://arxiv.org/abs/2603.01343
作者:Yimin Zhao,Sheela R. Damle,Simone E. Dekker,Scott Geng,Karly Williams Silva,Jesse J Hubbard,Manuel F Fernandez,Fatima Zelada-Arenas,Alejandra Alvarez,Brianne Flores,Alexis Rodriguez,Stephen Salerno,Carrie Wright,Zihao Wang,Pang Wei Koh,Jeffrey T. Leek
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:poorly reflects real-world, Large language models, Large language, multiple-choice accuracy poorly, accuracy poorly reflects
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.
72. 【2603.01331】MetaState: Persistent Working Memory for Discrete Diffusion Language Models
链接:https://arxiv.org/abs/2603.01331
作者:Kejing Xia,Mingzhe Li,Lixuan Wei,Zhenbang Du,Xiangchi Yuan,Qirui Jin,Wenke Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:generate text, text by iteratively, diffusion language models, textbf, language models
备注:
点击查看摘要
Abstract:Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
73. 【2603.01327】SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution
链接:https://arxiv.org/abs/2603.01327
作者:Kang He,Kaushik Roy
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, exhibit strong performance, self-contained programming tasks, Large language, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context management for accurate localization, and (2) systematic approaches for iterative, test-driven code modification to resolve issues. To address these challenges, we propose SWE-Adept, an LLM-based two-agent framework where a localization agent identifies issue-relevant code locations and a resolution agent implements the corresponding fixes. For issue localization, we introduce agent-directed depth-first search that selectively traverses code dependencies. This minimizes issue-irrelevant content in the agent's context window and improves localization accuracy. For issue resolution, we employ adaptive planning and structured problem solving. We equip the agent with specialized tools for progress tracking and Git-based version control. These tools interface with a shared working memory that stores code-state checkpoints indexed by execution steps, facilitating precise checkpoint retrieval. This design enables reliable agent-driven version-control operations for systematic issue resolution, including branching to explore alternative solutions and reverting failed edits. Experiments on SWE-Bench Lite and SWE-Bench Pro demonstrate that SWE-Adept consistently outperforms prior approaches in both issue localization and resolution, improving the end-to-end resolve rate by up to 4.7%.
74. 【2603.01326】ruth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning
链接:https://arxiv.org/abs/2603.01326
作者:Hamed Damirchi,Ignacio Meza De la Jara,Ehsan Abbasnejad,Afshar Shamsi,Zhen Zhang,Javen Shi
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, typically treat hidden, Existing explainability methods, treat hidden states
备注:
点击查看摘要
Abstract:Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
75. 【2603.01311】Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
链接:https://arxiv.org/abs/2603.01311
作者:Achuth Chandrasekhar,Janghoon Ock,Amir Barati Farimani
类目:Computation and Language (cs.CL)
关键词:twenty-first century, major challenge, Model Context Protocol, reduction reaction, first-principles approaches based
备注:
点击查看摘要
Abstract:The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2603.01311 [cs.CL]
(or
arXiv:2603.01311v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.01311
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
76. 【2603.01297】I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
链接:https://arxiv.org/abs/2603.01297
作者:Subramanyam Sahoo,Vinija Jain,Divya Chaudhary,Aman Chadha
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Instruction tuned reasoning, assuming representation stability, Instruction tuned, tuned reasoning models, assuming representation
备注: Accepted at the ICBINB: Where LLMs Need to Improve workshop at ICLR 2026. 12 pages and 3 Figures
点击查看摘要
Abstract:Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $\sigma=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.
77. 【2603.01291】JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks
链接:https://arxiv.org/abs/2603.01291
作者:Masahiro Kaneko,Ayana Niwa,Timothy Baldwin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:undermines societal trust, extreme cases threatens, cases threatens human, threatens human lives, undermines societal
备注: ICLR 2026
点击查看摘要
Abstract:Fake news undermines societal trust and decision-making across politics, economics, health, and international relations, and in extreme cases threatens human lives and societal safety. Because fake news reflects region-specific political, social, and cultural contexts and is expressed in language, evaluating the risks of large language models (LLMs) requires a multi-lingual and regional perspective. Malicious users can bypass safeguards through jailbreak attacks, inducing LLMs to generate fake news. However, no benchmark currently exists to systematically assess attack resilience across languages and regions. Here, we propose JailNewsBench, the first benchmark for evaluating LLM robustness against jailbreak-induced fake news generation. JailNewsBench spans 34 regions and 22 languages, covering 8 evaluation sub-metrics through LLM-as-a-Judge and 5 jailbreak attacks, with approximately 300k instances. Our evaluation of 9 LLMs reveals that the maximum attack success rate (ASR) reached 86.3% and the maximum harmfulness score was 3.5 out of 5. Notably, for English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions, highlighting substantial imbalances in safety across languages and regions. In addition, our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias. Our dataset and code are available at this https URL.
78. 【2603.01289】Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data
链接:https://arxiv.org/abs/2603.01289
作者:Minghao Guo,Ziyi Ye,Wujiang Xu,Xi Zhu,Wenyue Hua,Dimitris N. Metaxas
类目:Computation and Language (cs.CL)
关键词:Large Language Models, remarkable human-like capabilities, demonstrated remarkable human-like, individual remains under-explored, specific individual remains
备注: 5 pages, 2 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.
79. 【2603.01288】Efficient Extractive Summarization with MAMBA-Transformer Hybrids for Low-Resource Scenarios
链接:https://arxiv.org/abs/2603.01288
作者:Nisrine Ait Khayi
类目:Computation and Language (cs.CL)
关键词:quadratic complexity, resource-constrained settings, bottlenecked by quadratic, limiting deployment, deployment in resource-constrained
备注:
点击查看摘要
Abstract:Extractive summarization of long documents is bottlenecked by quadratic complexity, often forcing truncation and limiting deployment in resource-constrained settings. We introduce the first Mamba-Transformer hybrid for extractive summarization, combining the semantic strength of pre-trained transformers with the linear-time processing of state space models. Leveraging Mamba's ability to process full documents without truncation, our approach preserves context while maintaining strong summarization quality. The architecture includes: (1) a transformer encoder for sentence-level semantics, (2) a Mamba state space model to capture inter-sentence dependencies efficiently, and (3) a linear classifier for sentence relevance prediction. Across news, argumentative, and scientific domains under low-resource conditions, our method achieves: (1) large gains over BERTSUM and MATCHSUM, including +0.23 ROUGE-1 on ArXiv and statistically significant improvements on all datasets (p 0.001); (2) consistent advantages across domains, strongest on the longest documents; (3) robust performance with limited training data; and (4) 24-27% faster inference on news summarization (CNN/DailyMail). We introduce the first hybrid Transformer-state space architecture for summarization, showing significant ROUGE improvements in low-resource scenarios.
80. 【2603.01285】Attention Smoothing Is All You Need For Unlearning
链接:https://arxiv.org/abs/2603.01285
作者:Saleh Zare Zade,Xiangyu Zhou,Sijia Liu,Dongxiao Zhu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, posing significant privacy, Language Models, memorizing sensitive
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Large Language Models are prone to memorizing sensitive, copyrighted, or hazardous content, posing significant privacy and legal concerns. Retraining from scratch is computationally infeasible, whereas current unlearning methods exhibit unstable trade-offs between forgetting and utility, frequently producing incoherent outputs on forget prompts and failing to generalize due to the persistence of lexical-level and semantic-level associations in attention. We propose Attention Smoothing Unlearning (ASU), a principled framework that casts unlearning as self-distillation from a forget-teacher derived from the model's own attention. By increasing the softmax temperature, ASU flattens attention distributions and directly suppresses the lexical-level and semantic-level associations responsible for reconstructing memorized knowledge. This results in a bounded optimization objective that erases factual information yet maintains coherence in responses to forget prompts. Empirical evaluation on TOFU, MUSE, and WMDP, along with real-world and continual unlearning scenarios across question answering and text completion, demonstrates that ASU outperforms the baselines for most unlearning scenarios, delivering robust unlearning with minimal loss of model utility.
81. 【2603.01281】Spectral Attention Steering for Prompt Highlighting
链接:https://arxiv.org/abs/2603.01281
作者:Weixian Waylon Li,Yuchen Niu,Yongxin Yang,Keshuang Li,Tiejun Ma,Shay B. Cohen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:controlling model focus, prioritises user-specified text, model prioritises user-specified, model focus, enabling capabilities
备注: Accepted to ICLR 2026 (Poster, Top 4%)
点击查看摘要
Abstract:Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt's semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention.
82. 【2603.01266】A Study on Building Efficient Zero-Shot Relation Extraction Models
链接:https://arxiv.org/abs/2603.01266
作者:Hugo Thomas,Caio Corro,Guillaume Gravier,Pascale Sébillot
类目:Computation and Language (cs.CL)
关键词:Zero-shot relation extraction, previously unseen, relation extraction aims, aims to identify, textual descriptions
备注: LREC 2026
点击查看摘要
Abstract:Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.
83. 【2603.01254】LLM Self-Explanations Fail Semantic Invariance
链接:https://arxiv.org/abs/2603.01254
作者:Stefan Szeider
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLM self-explanations, semantic invariance testing, present semantic invariance, LLM, invariance testing
备注:
点击查看摘要
Abstract:We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.01254 [cs.CL]
(or
arXiv:2603.01254v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.01254
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
84. 【2603.01252】Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation
链接:https://arxiv.org/abs/2603.01252
作者:Liwen Sun,Xiang Yu,Ming Tan,Zhuohao Chen,Anqi Cheng,Ashutosh Joshi,Chenyan Xiong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:requiring intensive interactions, Clinical diagnosis, diagnosis is time-consuming, requiring intensive, intensive interactions
备注: Short paper published in the Findings of EACL 2026
点击查看摘要
Abstract:Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.
85. 【2603.01243】Suffix-Constrained Greedy Search Algorithms for Causal Language Models
链接:https://arxiv.org/abs/2603.01243
作者:Ayoub Hammal,Pierre Zweigenbaum,Caio Corro
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, interfaces and chatbots, powerful tools, found applications
备注:
点击查看摘要
Abstract:Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2603.01243 [cs.CL]
(or
arXiv:2603.01243v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.01243
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
86. 【2603.01239】Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence
链接:https://arxiv.org/abs/2603.01239
作者:Harshavardhan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, Claude Sonnet, hypothesized tendency, tendency for large, large language
备注:
点击查看摘要
Abstract:We introduce Self-Anchoring Calibration Drift (SACD), a hypothesized tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. We report an empirical study comparing three frontier models -- Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 -- across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline (A), multi-turn self-anchoring (B), and independent repetition control (C). Results reveal a complex, model-heterogeneous pattern that partially diverges from pre-registered hypotheses. Claude Sonnet 4.6 exhibited significant decreasing confidence under self-anchoring (mean CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627), while also showing significant calibration error drift (F(4,56) = 22.77, p .001, eta^2 = .791). GPT-5.2 showed the opposite pattern in open-ended domains (mean CDS = +0.026) with significant ECE escalation by Turn 5. Gemini 3.1 Pro showed no significant CDS (t(14) = 0.38, p = .710), but its Condition C data reveals a striking ECE pattern: without self-anchoring, Gemini's calibration error drops from .327 to near zero across repetitions, whereas self-anchoring holds ECE flat at approximately .333 -- indicating that SACD can manifest as suppression of natural calibration improvement rather than ac
87. 【2603.01225】Can Thinking Models Think to Detect Hateful Memes?
链接:https://arxiv.org/abs/2603.01225
作者:Mohamed Bayan Kmainasi,Mucahid Kutlu,Ali Ezzat Shahroor,Abul Hasnat,Firoj Alam
类目:Computation and Language (cs.CL)
关键词:conveys harmful intent, interaction conveys harmful, require compositional multimodal, Relative Policy Optimization, hateful meme
备注:
点击查看摘要
Abstract:Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
88. 【2603.01223】Learn Hard Problems During RL with Reference Guided Fine-tuning
链接:https://arxiv.org/abs/2603.01223
作者:Yangzhen Wu,Shanda Li,Zixin Wen,Xin Zhou,Ameet Talwalkar,Yiming Yang,Wenhao Huang,Tianle Cai
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:LLM fails, meaningful positive feedback, Reinforcement learning, receiving meaningful positive, fails to sample
备注: 16 pages, 5 figures
点击查看摘要
Abstract:Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
Comments:
16 pages, 5 figures
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2603.01223 [cs.LG]
(or
arXiv:2603.01223v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.01223
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
89. 【2603.01220】Generative AI Fictionality: How Novels Power Large Language Models
链接:https://arxiv.org/abs/2603.01220
作者:Edwin Roland,Richard Jean So
类目:Computation and Language (cs.CL)
关键词:simply next-word predictors, training data, fiction, models, Generative
备注:
点击查看摘要
Abstract:Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels' effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today's various modes of cultural production must account for a relatively novel dimension: computational training data.
90. 【2603.01214】Reasoning Boosts Opinion Alignment in LLMs
链接:https://arxiv.org/abs/2603.01214
作者:Frédéric Berdoz,Yann Billeter,Yann Vonlanthen,Roger Wattenhofer
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:group political preferences, popular policies, aims to capture, capture individual, individual or group
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
91. 【2603.01212】XAI-enhanced Comparative Opinion Mining via Aspect-based Scoring and Semantic Reasoning
链接:https://arxiv.org/abs/2603.01212
作者:Ngoc-Quang Le,T. Thanh-Lam Nguyen,Quoc-Trung Phu,Thi-Phuong Le,Duy-Cat Can,Hoang-Quynh Le
类目:Computation and Language (cs.CL)
关键词:involves comparing products, Comparative opinion mining, Comparative opinion, mining involves comparing, involves comparing
备注:
点击查看摘要
Abstract:Comparative opinion mining involves comparing products from different reviews. However, transformer-based models designed for this task often lack transparency, which can adversely hinder the development of trust in users. In this paper, we propose XCom, an enhanced transformer-based model separated into two principal modules, i.e., (i) aspect-based rating prediction and (ii) semantic analysis for comparative opinion mining. XCom also incorporates a Shapley additive explanations module to provide interpretable insights into the model's deliberative decisions. Empirically, XCom achieves leading performances compared to other baselines, which demonstrates its effectiveness in providing meaningful explanations, making it a more reliable tool for comparative opinion mining. Source code is available at: this https URL.
92. 【2603.01211】A Unified Framework to Quantify Cultural Intelligence of AI
链接:https://arxiv.org/abs/2603.01211
作者:Sunipa Dev,Vinodkumar Prabhakaran,Rutledge Chin Feman,Aida Davani,Remi Denton,Charu Kalia,Piyawat L Kumjorn,Madhurima Maji,Rida Qadri,Negar Rostamzadeh,Renee Shelby,Romina Stella,Hayk Stepanyan,Erin van Liemt,Aishwarya Verma,Oscar Wahltinez,Edem Wornyo,Andrew Zaldivar,Saška Mojsilović
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:generative AI technologies, technologies are increasingly, increasingly being launched, contexts is exigently, cultural
备注:
点击查看摘要
Abstract:As generative AI technologies are increasingly being launched across the globe, assessing their competence to operate in different cultural contexts is exigently becoming a priority. While recent years have seen numerous and much-needed efforts on cultural benchmarking, these efforts have largely focused on specific aspects of culture and evaluation. While these efforts contribute to our understanding of cultural competence, a unified and systematic evaluation approach is needed for us as a field to comprehensively assess diverse cultural dimensions at scale. Drawing on measurement theory, we present a principled framework to aggregate multifaceted indicators of cultural capabilities into a unified assessment of cultural intelligence. We start by developing a working definition of culture that includes identifying core domains of culture. We then introduce a broad-purpose, systematic, and extensible framework for assessing cultural intelligence of AI systems. Drawing on theoretical framing from psychometric measurement validity theory, we decouple the background concept (i.e., cultural intelligence) from its operationalization via measurement. We conceptualize cultural intelligence as a suite of core capabilities spanning diverse domains, which we then operationalize through a set of indicators designed for reliable measurement. Finally, we identify the considerations, challenges, and research pathways to meaningfully measure these indicators, specifically focusing on data collection, probing strategies, and evaluation metrics.
93. 【2603.01190】Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification
链接:https://arxiv.org/abs/2603.01190
作者:Jacob Devasier
类目:Computation and Language (cs.CL)
关键词:Masked Diffusion Language, Unlike autoregressive models, sequence positions simultaneously, handle tasks requiring, tasks requiring justified
备注:
点击查看摘要
Abstract:Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.
94. 【2603.01185】oken-level Data Selection for Safe LLM Fine-tuning
链接:https://arxiv.org/abs/2603.01185
作者:Yanping Li,Zhening Liu,Zijian Li,Zehong Lin,Jun Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Fine-tuning large language, large language models, domains and applications, large language, custom datasets
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at this https URL.
95. 【2603.01167】DEP: A Decentralized Large Language Model Evaluation Protocol
链接:https://arxiv.org/abs/2603.01167
作者:Jianxiang Peng,Junhao Li,Hongxiang Wang,Haocheng Lyu,Hui Guo,Siyi Hao,Zhen Wang,Chuang Liu,Shaowei Zhang,Bojian Xiong,Yue Chen,Zhuowen Han,Ling Shi,Tianyu Dong,Juesi Xiao,Lei Yang,Yuqi Ren,Deyi Xiong
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, large number, Large
备注:
点击查看摘要
Abstract:With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server side. In remote setting, users cannot access the ground truth, thereby achieving data isolation and leak-proof evaluation. To facilitate practical adoption, we develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control. We also provide detailed documentation for adapting new benchmarks to DEP. Using DEP toolkit, we evaluate multiple LLMs across benchmarks. Experimental results verify the effectiveness of DEP and show that it reduces the cost of deploying benchmark evaluations. As of February 2026, we have adapted over 60 benchmarks and continue to promote community co-construction to support unified evaluation across various tasks and domains.
96. 【2603.01160】Semantic XPath: Structured Agentic Memory Access for Conversational AI
链接:https://arxiv.org/abs/2603.01160
作者:Yifan Simon Liu,Ruifan Wu,Liam Gallagher,Jiazhou Liang,Armin Toroghi,Scott Sanner
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:agents increasingly maintain, increasingly maintain structured, agents increasingly, increasingly maintain, memory
备注:
点击查看摘要
Abstract:Conversational AI (ConvAI) agents increasingly maintain structured memory to support long-term, task-oriented interactions. In-context memory approaches append the growing history to the model input, which scales poorly under context-window limits. RAG-based methods retrieve request-relevant information, but most assume flat memory collections and ignore structure. We propose Semantic XPath, a tree-structured memory module to access and update structured conversational memory. Semantic XPath improves performance over flat-RAG baselines by 176.7% while using only 9.1% of the tokens required by in-context memory. We also introduce SemanticXPath Chat, an end-to-end ConvAI demo system that visualizes the structured memory and query execution details. Overall, this paper demonstrates a candidate for the next generation of long-term, task-oriented ConvAI systems built on structured memory.
97. 【2603.01096】Unified Vision-Language Modeling via Concept Space Alignment
链接:https://arxiv.org/abs/2603.01096
作者:Yifu Qiu,Paul-Ambroise Duquenne,Holger Schwenk
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Omnilingual Embeddings Team, embedding space extended, Omnilingual Embeddings, embedding space SONAR, V-SONAR
备注: ICLR 2026
点击查看摘要
Abstract:We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
Comments:
ICLR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2603.01096 [cs.CV]
(or
arXiv:2603.01096v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.01096
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
98. 【2603.01089】CARD: Towards Conditional Design of Multi-agent Topological Structures
链接:https://arxiv.org/abs/2603.01089
作者:Tongtong Wu,Yanming Li,Ziye Tang,Chen Jiang,Linhao Luo,Guilin Qi,Shirui Pan,Gholamreza Haffari
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language model, shown strong capabilities, Large language, based multi-agent systems, collaborative reasoning
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: this https URL.
99. 【2603.01070】How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
链接:https://arxiv.org/abs/2603.01070
作者:Xiangxiang Zhang,Caijun Jia,Siyuan Li,Dingyu He,Xiya Xiong,Zheng Sun,Honghao He,Yuchen Wu,Bihui Yu,Linzhuang Sun,Cheng Tan,Jingxuan Wei
类目:Computation and Language (cs.CL)
关键词:performing logical deductions, Solving complex geometric, problems inherently requires, Solving complex, Multimodal Large Language
备注:
点击查看摘要
Abstract:Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
100. 【2603.01059】GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
链接:https://arxiv.org/abs/2603.01059
作者:Zhuokang Shen,Yifan Wang,Hanyu Chen,Wenxuan Huang,Shaohui Lin
类目:Computation and Language (cs.CL)
关键词:increasingly capable chatbots, enabled increasingly capable, Recent advances, capable chatbots, enabled increasingly
备注: Work in progress
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: this https URL .
101. 【2603.01042】hoth: Mid-Training Bridges LLMs to Time Series Understanding
链接:https://arxiv.org/abs/2603.01042
作者:Jiafeng Lin,Yuxuan Wang,Jialong Wu,Huakun Luo,Zhongyi Pei,Jianmin Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, demonstrated remarkable success, Large Language, time series, demonstrated remarkable
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: this https URL.
102. 【2603.01009】Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays
链接:https://arxiv.org/abs/2603.01009
作者:Hoor Elbahnasawi,Marwan Sayed,Sohaila Eltanbouly,Fatima Brahamia,Tamer Elsayed
类目:Computation and Language (cs.CL)
关键词:gained increasing attention, Automated Essay Scoring, Arabic AES, Automated Essay, Arabic AES remains
备注:
点击查看摘要
Abstract:Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.
103. 【2603.00963】Stabilizing Policy Optimization via Logits Convexity
链接:https://arxiv.org/abs/2603.00963
作者:Hongzhan Chen,Tao Yang,Yuhua Zhu,Shiping Gao,Xiaojun Quan,Ting Yao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, reinforcement learning, notoriously unstable, supervised fine-tuning, recent success
备注:
点击查看摘要
Abstract:While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
104. 【2603.00958】S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature
链接:https://arxiv.org/abs/2603.00958
作者:Abigail Berthe-Pardo(1),Gaspard Michel(1 and 2),Elena V. Epure(2 and 3),Christophe Cerisara(1) ((1) LORIA, Vandœuvre-lès-Nancy, France, (2) Deezer Research, Paris, France, (3) Idiap Research Institute, Switzerland)
类目:Computation and Language (cs.CL)
关键词:reaching unprecedented levels, synthetic audiobook narration, increased interest, reaching unprecedented, levels of naturalness
备注: Accepted to LREC 2026
点击查看摘要
Abstract:With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character's age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems' performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at this https URL .
105. 【2603.00941】owards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages
链接:https://arxiv.org/abs/2603.00941
作者:Kaushal Santosh Bhogale,Tahir Javed,Greeshma Susan John,Dhruv Rathi,Akshayasree Padmanaban,Niharika Parasa,Mitesh M. Khapra
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:Evaluating ASR systems, Evaluating ASR, suffix splitting flexibility, ASR systems, Word Error Rate
备注: Accepted in ICASSP 2026
点击查看摘要
Abstract:Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.
106. 【2603.00925】he Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
链接:https://arxiv.org/abs/2603.00925
作者:Li Lucy,Albert Zhang,Nathan Anderson,Ryan Knight,Kyle Lo
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:Effective mathematics education, Effective mathematics, identifying and responding, Effective, students' mistakes
备注: 15 pages, 10 figures
点击查看摘要
Abstract:Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
107. 【2603.00924】Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
链接:https://arxiv.org/abs/2603.00924
作者:Manil Shrestha,Edward Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, confidence scores, Large
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($\tau \approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($\tau$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.
108. 【2603.00923】Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
链接:https://arxiv.org/abs/2603.00923
作者:Siyu Liang,Talant Mawkanuli,Gina-Anne Levow
类目:Computation and Language (cs.CL)
关键词:Interlinear glossed text, Interlinear glossed, glossed text, creation remains, remains a major
备注:
点击查看摘要
Abstract:Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
109. 【2603.00917】Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
链接:https://arxiv.org/abs/2603.00917
作者:Shravani Hariprasad
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:remains poorly understood, phrasings remains poorly, Small open-source language, Small open-source, prompt phrasings remains
备注: 30 pages, 7 figures, 2 tables
点击查看摘要
Abstract:Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
Comments:
30 pages, 7 figures, 2 tables
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.00917 [cs.CL]
(or
arXiv:2603.00917v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.00917
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
110. 【2603.00907】KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
链接:https://arxiv.org/abs/2603.00907
作者:Lianjun Liu,Hongli An,Weiqi Yan,Xin Du,Shengchuan Zhang,Huazhong Liu,Yunshan Zhong
类目:Computation and Language (cs.CL)
关键词:Large Language Models, cache significantly limit, Large Language, ability of Large, cache significantly
备注:
点击查看摘要
Abstract:The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
111. 【2603.00889】CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
链接:https://arxiv.org/abs/2603.00889
作者:Xinyu Zhu,Yihao Feng,Yanchao Sun,Xianzhi Du,Pingzhi Li,Olli Saarikivi,Yun Zhu,Yu Meng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, recently exhibited remarkable, high-quality reasoning data, exhibited remarkable reasoning
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
112. 【2603.00842】MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine
链接:https://arxiv.org/abs/2603.00842
作者:Kai Zhang,Zhengqing Yuan,Cheng Peng,Songlin Zhao,Mengxian Lyu,Ziyi Chen,Yanfang Ye,Wei Liu,Ying Zhang,Kaleb E Smith,Lifang He,Lichao Sun,Yonghui Wu
类目:Computation and Language (cs.CL)
关键词:on-premises deployment required, Biomedical multimodal assistants, deployment gap remains, PHI compliance, critical deployment gap
备注: Technical report, work in progress
点击查看摘要
Abstract:Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.
113. 【2603.00840】Learning Nested Named Entity Recognition from Flat Annotations
链接:https://arxiv.org/abs/2603.00840
作者:Igor Rozhkov,Natalia Loukachevitch
类目:Computation and Language (cs.CL)
关键词:requires expensive multi-level, expensive multi-level annotation, recognition identifies entities, identifies entities contained, named entity recognition
备注: Accepted at EACL 2026, 15 pages, 2 figures, 8 tables
点击查看摘要
Abstract:Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at this https URL.
114. 【2603.00829】Constitutional Black-Box Monitoring for Scheming in LLM Agents
链接:https://arxiv.org/abs/2603.00829
作者:Simon Storf,Rich Barton-Cooper,James Peters-Gill,Marius Hobbhahn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Model, reliable oversight mechanisms, requires reliable oversight, Safe deployment, deployment of Large
备注:
点击查看摘要
Abstract:Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.
115. 【2603.00824】A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations
链接:https://arxiv.org/abs/2603.00824
作者:Hossein Javidnia
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
关键词:discrete gauge-theoretic framework, large language models, local semantic charts, develop a discrete, discrete gauge-theoretic
备注: 16 pages, 4 figures
点击查看摘要
Abstract:We develop a discrete gauge-theoretic framework for superposition in large language models (LLMs) that replaces the single-global-dictionary premise with a sheaf-theoretic atlas of local semantic charts. Contexts are clustered into a stratified context complex; each chart carries a local feature space and a local information-geometric metric (Fisher/Gauss--Newton) identifying predictively consequential feature interactions. This yields a Fisher-weighted interference energy and three measurable obstructions to global interpretability: (O1) local jamming (active load exceeds Fisher bandwidth), (O2) proxy shearing (mismatch between geometric transport and a fixed correspondence proxy), and (O3) nontrivial holonomy (path-dependent transport around loops). We prove and instantiate four results on a frozen open LLM (Llama~3.2~3B Instruct) using WikiText-103, a C4-derived English web-text subset, and \texttt{the-stack-smol}. (A) After constructive gauge fixing on a spanning tree, each chord residual equals the holonomy of its fundamental cycle, making holonomy computable and gauge-invariant. (B) Shearing lower-bounds a data-dependent transfer mismatch energy, turning $D_{\mathrm{shear}}$ into an unavoidable failure bound. (C) We obtain non-vacuous certified jamming/interference bounds with high coverage and zero violations across seeds/hyperparameters. (D) Bootstrap and sample-size experiments show stable estimation of $D_{\mathrm{shear}}$ and $D_{\mathrm{hol}}$, with improved concentration on well-conditioned subsystems.
Comments:
16 pages, 4 figures
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
Cite as:
arXiv:2603.00824 [cs.LG]
(or
arXiv:2603.00824v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.00824
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
116. 【2603.00823】A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction
链接:https://arxiv.org/abs/2603.00823
作者:Ruihao Pan,Suhang Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Machine unlearning aims, large language models, specific training data, due to safety, Machine unlearning
备注:
点击查看摘要
Abstract:Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.
117. 【2603.00729】Qwen3-Coder-Next Technical Report
链接:https://arxiv.org/abs/2603.00729
作者:Ruisheng Cao,Mouxiang Chen,Jiawei Chen,Zeyu Cui,Yunlong Feng,Binyuan Hui,Yuheng Jing,Kaixin Li,Mingze Li,Junyang Lin,Zeyao Ma,Kashun Shum,Xuwu Wang,Jinxi Wei,Jiaxi Yang,Jiajun Zhang,Lei Zhang,Zongmeng Zhang,Wenting Zhao,Fan Zhou
类目:Computation and Language (cs.CL)
关键词:language model specialized, Abstract, open-weight language model, language model, model specialized
备注: Authors are listed alphabetically by their last names
点击查看摘要
Abstract:We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.
118. 【2603.00725】LaSTR: Language-Driven Time-Series Segment Retrieval
链接:https://arxiv.org/abs/2603.00725
作者:Kota Dohi,Harsh Purohit,Tomoya Nishida,Takashi Endo,Yusuke Ohtsubo,Koichiro Yawata,Koki Takeshita,Tatsuya Sasaki,Yohei Kawaguchi
类目:Computation and Language (cs.CL)
关键词:Effectively searching time-series, require expert-designed similarity, expert-designed similarity criteria, Effectively searching, system analysis
备注:
点击查看摘要
Abstract:Effectively searching time-series data is essential for system analysis, but existing methods often require expert-designed similarity criteria or rely on global, series-level descriptions. We study language-driven segment retrieval: given a natural language query, the goal is to retrieve relevant local segments from large time-series repositories. We build large-scale segment--caption training data by applying TV2-based segmentation to LOTSA windows and generating segment descriptions with GPT-5.2, and then train a Conformer-based contrastive retriever in a shared text--time-series embedding space. On a held-out test split, we evaluate single-positive retrieval together with caption-side consistency (SBERT and VLM-as-a-judge) under multiple candidate pool sizes. Across all settings, LaSTR outperforms random and CLIP baselines, yielding improved ranking quality and stronger semantic agreement between retrieved segments and query intent.
119. 【2603.00724】RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
链接:https://arxiv.org/abs/2603.00724
作者:Andrew Zhuoer Feng,Cunxiang Wang,Bosi Wen,Yidong Wang,Yu Luo,Hongning Wang,Minlie Huang
类目:Computation and Language (cs.CL)
关键词:Large language model, learning depends critically, reinforcement learning depends, Large language, language model alignment
备注: 25 pages, 7 figures
点击查看摘要
Abstract:Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: this https URL.
120. 【2603.00718】SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
链接:https://arxiv.org/abs/2603.00718
作者:Shiqi Chen,Jingze Gai,Ruochen Zhou,Jinghan Zhang,Tongyao Zhu,Junlong Li,Kangrui Wang,Zihan Wang,Zhengyu Chen,Klara Kaleb,Ning Miao,Siyang Gao,Cong Lu,Manling Li,Junxian He,Yee Whye Teh
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Real-world tool-using agents, effective behavior requires, Real-world tool-using, tool-using agents operate, reusing higher-level tool
备注: 21 pages. Code: [this https URL](https://github.com/shiqichen17/SkillCraft) ; Project page: [this https URL](https://skillcraft-website.github.io/page)
点击查看摘要
Abstract:Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.
121. 【2603.00696】DRIV-EX: Counterfactual Explanations for Driving LLMs
链接:https://arxiv.org/abs/2603.00696
作者:Amaia Cardiel,Eloi Zablocki,Elias Ramzi,Eric Gaussier
类目:Computation and Language (cs.CL)
关键词:decision-making remains opaque, Large language models, Large language, remains opaque, reasoning engines
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model's decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2603.00696 [cs.CL]
(or
arXiv:2603.00696v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.00696
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
122. 【2603.00686】RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis
链接:https://arxiv.org/abs/2603.00686
作者:Andrew Zhuoer Feng,Cunxiang Wang,Yu Luo,Bosi Wen,Yidong Wang,Lin Fan,Yilin Zhou,Zikang Wang,Wenbo Yu,Lindong Wu,Hongning Wang,Minlie Huang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, Models have evolved, long-horizon agents
备注: 35 pages, 7 figures
点击查看摘要
Abstract:Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse-engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: this https URL.
123. 【2603.00683】Polynomial Mixing for Efficient Self-supervised Speech Encoders
链接:https://arxiv.org/abs/2603.00683
作者:Eva Feillet,Ryan Whetten,David Picard,Alexandre Allauzen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:typically employ Transformer-based, employ Transformer-based encoders, models typically employ, model token dependencies, employ Transformer-based
备注: Accepted at ICASSP 2026
点击查看摘要
Abstract:State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.
124. 【2603.00669】SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs
链接:https://arxiv.org/abs/2603.00669
作者:Chaoyue He,Xin Zhou,Xinjia Yu,Lei Zhang,Yan Zhang,Yi Wu,Lei Xiao,Liangyue Li,Di Wang,Hong Xu,Xiaoqiao Wang,Wei Liu,Chunyan Miao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Sustainability disclosure standards, hindering structured analysis, Sustainability Standards Knowledge, Standards Knowledge Graph, Knowledge Graph Hub
备注: 10 pages, 2 figures, 2 tables, submitted to ACL26 System Demo Track
点击查看摘要
Abstract:Sustainability disclosure standards (e.g., GRI, SASB, TCFD, IFRS S2) are comprehensive yet lengthy, terminology-dense, and highly cross-referential, hindering structured analysis and downstream use. We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline. The system integrates automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage with fine-grained audit metadata. LLM extraction produces a provenance-linked Draft KG, which is reviewed, curated, and formally promoted to a Certified KG through meta-expert adjudication. A role-based governance framework covering read-only guest access, expert review and CRUD operations, meta-expert certification, and administrative oversight ensures traceability and accountability across draft and certified states. Beyond graph exploration and triple-level evidence tracing, SSKG Hub supports cross-KG fusion, KG-driven tasks, and dedicated modules for insights and curated resources. We validate the platform through a comprehensive expert-led KG review case study that demonstrates end-to-end curation and quality assurance. The web application is publicly available at this http URL.
125. 【2603.00634】BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
链接:https://arxiv.org/abs/2603.00634
作者:Jason Lucas,Matt Murtagh-White,Adaku Uchendu,Ali Al-Lawati,Michiharu Yamashita,Dominik Macko,Ivan Srba,Robert Moro,Dongwon Lee
类目:Computation and Language (cs.CL)
关键词:benchmarks remain confined, leaving low-resource linguistic, information integrity worldwide, low-resource linguistic communities, threaten information integrity
备注:
点击查看摘要
Abstract:Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource "big-head" (20) and low-resource "long-tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chainof-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-theart detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: this https URL
126. 【2603.00623】raceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces
链接:https://arxiv.org/abs/2603.00623
作者:Shu-Xun Yang,Cunxiang Wang,Haoke Zhang,Wenbo Yu,Lindong Wu,Jiayi Gui,Dayong Yang,Yukuo Cen,Zhuoer Feng,Bosi Wen,Yidong Wang,Lucen Zhong,Jiamin Ren,Linfeng Zhang,Jie Tang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:iterative decision making, systems augment large, augment large language, large language models, enabling complex tasks
备注:
点击查看摘要
Abstract:Agentic systems augment large language models with external tools and iterative decision making, enabling complex tasks such as deep research, function calling, and coding. However, their long and intricate execution traces make failure diagnosis and root cause analysis extremely challenging. Manual inspection does not scale, while directly applying LLMs to raw traces is hindered by input length limits and unreliable reasoning. Focusing solely on final task outcomes further discards critical behavioral information required for accurate issue localization. To address these issues, we propose TraceSIR, a multi-agent framework for structured analysis and reporting of agentic execution traces. TraceSIR coordinates three specialized agents: (1) StructureAgent, which introduces a novel abstraction format, TraceFormat, to compress execution traces while preserving essential behavioral information; (2) InsightAgent, which performs fine-grained diagnosis including issue localization, root cause analysis, and optimization suggestions; (3) ReportAgent, which aggregates insights across task instances and generates comprehensive analysis reports. To evaluate TraceSIR, we construct TraceBench, covering three real-world agentic scenarios, and introduce ReportEval, an evaluation protocol for assessing the quality and usability of analysis reports aligned with industry needs. Experiments show that TraceSIR consistently produces coherent, informative, and actionable reports, significantly outperforming existing approaches across all evaluation dimensions. Our project and video are publicly available at this https URL.
127. 【2603.00621】Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
链接:https://arxiv.org/abs/2603.00621
作者:Anastasia Zhukova,Terry Ruas,Jan Philip Wahle,Bela Gipp
类目:Computation and Language (cs.CL)
关键词:remains fragmented due, CDCR remains fragmented, varying annotation standards, heterogeneous dataset formats, English CDCR corpora
备注: accepted to LREC 2026
点击查看摘要
Abstract:Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at this https URL, and the code for parsing, analyzing, and scoring the dataset is available at this https URL.
128. 【2603.00620】QQ: A Toolkit for Language Identifiers and Metadata
链接:https://arxiv.org/abs/2603.00620
作者:Wessel Poelman,Yiyi Chen,Miryam de Lhoneux
类目:Computation and Language (cs.CL)
关键词:poses challenges, growing number, challenges regarding properly, properly and accurately, accurately reporting
备注: System Demo
点击查看摘要
Abstract:The growing number of languages considered in multilingual NLP, including new datasets and tasks, poses challenges regarding properly and accurately reporting which languages are used and how. For example, datasets often use different language identifiers; some use BCP-47 (e.g. en_Latn), others use ISO 639-1 (en), and more linguistically oriented datasets use Glottocodes (stan1293). Mapping between identifiers is manageable for a few dozen languages, but becomes unscalable when dealing with thousands. We introduce QwanQwa, a light-weight Python toolkit for unified language metadata management. QQ integrates multiple language resources into a single interface, provides convenient normalization and mapping between language identifiers, and affords a graph-based structure that enables traversal across families, regions, writing systems, and other linguistic attributes. QQ serves both as (1) a simple "glue" library in multilingual NLP research to make working with many languages easier, and (2) as an intuitive way for exploring languages, such as finding related ones through shared scripts, regions or other metadata.
129. 【2603.00612】From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation
链接:https://arxiv.org/abs/2603.00612
作者:Raneen Younis,Suvinava Basak,Lukas Chavez,Zahra Ahmadi
类目:Computation and Language (cs.CL)
关键词:systematically connect biomarker, connect biomarker mechanisms, actionable drug combination, drug combination hypotheses, rapid growth
备注:
点击查看摘要
Abstract:The rapid growth of biomedical literature and curated databases has made it increasingly difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses. We present AI Co-Scientist (CoDHy), an interactive, human-in-the-loop system for biomarker-guided drug combination hypothesis generation in cancer research. CoDHy integrates structured biomedical databases and unstructured literature evidence into a task-specific knowledge graph, which serves as the basis for graph-based reasoning and hypothesis construction. The system combines knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations, while explicitly grounding each hypothesis in retrievable evidence. Through a web-based interface, users can configure the scientific context, inspect intermediate results, and iteratively refine hypotheses, enabling transparent and researcher-steerable exploration rather than automated decision-making. We demonstrate CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases.
130. 【2603.00592】LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
链接:https://arxiv.org/abs/2603.00592
作者:Yuchen Hou,Lin Zhao
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:VLA models, VLA, language, VLA models largely, cs.RO
备注: 7 pages, 3 figures. Code and benchmark will be available at [this https URL](https://github.com/YC11Hou/langgap)
点击查看摘要
Abstract:Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in {\pi}0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap -- success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions -- precisely the long-term value of LangGap.
Comments:
7 pages, 3 figures. Code and benchmark will be available at this https URL
Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2603.00592 [cs.RO]
(or
arXiv:2603.00592v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2603.00592
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
131. 【2603.00582】Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research
链接:https://arxiv.org/abs/2603.00582
作者:Yubo Dong,Nianhao You,Yuxuan Hou,Zixun Sun,Yue Zhang,Hehe Fan,Liang Zhang,Siyuan Zhao,Linyi
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, sources-remains largely unexplored, heterogeneous sources-remains largely, Wide Search
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: this https URL
132. 【2603.00578】Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
链接:https://arxiv.org/abs/2603.00578
作者:Jie Cao,Tianwei Lin,Zhenxuan Fan,Bo Yuan,Ziyuan Zhao,Rolan Yan,Wenqiao Zhang,Siliang Tang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reasoning, substantial increase, existing CoT paradigms, CoT paradigms tend, large reasoning models
备注:
点击查看摘要
Abstract:Long chain-of-thought~(CoT) has become a dominant paradigm for enhancing the reasoning capability of large reasoning models~(LRMs); however, the performance gains often come with a substantial increase in reasoning budget. Recent studies show that existing CoT paradigms tend to induce systematic overthinking, unnecessarily coupling reasoning capability with reasoning cost. Most prior approaches reduce token usage through post hoc techniques such as token compression, truncation, or length penalties, without explicitly addressing the core mechanisms of reasoning. We propose \textbf{Draft-Thinking}, which guides models to first learn a concise \textit{draft-style} reasoning structure that retains only the critical reasoning steps. Through a \textit{progressive curriculum learning}, the model stably internalizes this efficient reasoning pattern as its capability scales. Moreover, Draft-Thinking introduces adaptive prompting, which elevates reasoning depth to a flexible, model-selectable behavior. Extensive experiments demonstrate that Draft-Thinking substantially reduces reasoning budget while largely preserving reasoning performance; for example, on MATH500, it achieves an 82.6\% reduction in reasoning budget at the cost of only a 2.6\% performance drop.
133. 【2603.00573】CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging
链接:https://arxiv.org/abs/2603.00573
作者:Jie Cao,Zhenxuan Fan,Zhuonan Wang,Tianwei Lin,Ziyuan Zhao,Rolan Yan,Wenqiao Zhang,Feifei Shao,Hongwei Wang,Jun Xiao,Siliang Tang
类目:Computation and Language (cs.CL)
关键词:Large language models, achieve remarkable performance, Large language, Core Space, Core
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (\textbf{CoMoL}), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks.
134. 【2603.00523】CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
链接:https://arxiv.org/abs/2603.00523
作者:Swapnil Parekh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:arbitrary analyst choices, analyst choices, feature dictionaries, yielding brittle, notion of uncertainty
备注:
点击查看摘要
Abstract:Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.
135. 【2603.00465】Optimizing In-Context Demonstrations for LLM-based Automated Grading
链接:https://arxiv.org/abs/2603.00465
作者:Yucheng Chu,Hang Li,Kaiqi Yang,Yasemin Copur-Gencturk,Kevin Haudek,Joseph Krajcik,Jiliang Tang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:scaling personalized feedback, open-ended student responses, feedback in education, open-ended student, critical capability
备注:
点击查看摘要
Abstract:Automated assessment of open-ended student responses is a critical capability for scaling personalized feedback in education. While large language models (LLMs) have shown promise in grading tasks via in-context learning (ICL), their reliability is heavily dependent on the selection of few-shot exemplars and the construction of high-quality rationales. Standard retrieval methods typically select examples based on semantic similarity, which often fails to capture subtle decision boundaries required for rubric adherence. Furthermore, manually crafting the expert rationales needed to guide these models can be a significant bottleneck. To address these limitations, we introduce GUIDE (Grading Using Iteratively Designed Exemplars), a framework that reframes exemplar selection and refinement in automated grading as a boundary-focused optimization problem. GUIDE operates on a continuous loop of selection and refinement, employing novel contrastive operators to identify "boundary pairs" that are semantically similar but possess different grades. We enhance exemplars by generating discriminative rationales that explicitly articulate why a response receives a specific score to the exclusion of adjacent grades. Extensive experiments across datasets in physics, chemistry, and pedagogical content knowledge demonstrate that GUIDE significantly outperforms standard retrieval baselines. By focusing the model's attention on the precise edges of rubric, our approach shows exceptionally robust gains on borderline cases and improved rubric adherence. GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
136. 【2603.00451】Confusion-Aware Rubric Optimization for LLM-based Automated Grading
链接:https://arxiv.org/abs/2603.00451
作者:Yucheng Chu,Hang Li,Kaiqi Yang,Yasemin Copur-Gencturk,Joseph Krajcik,Namsoo Shin,Jiliang Tang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:misinterpret expert guidelines, Accurate and unambiguous, large language model, based graders, domain specificity
备注:
点击查看摘要
Abstract:Accurate and unambiguous guidelines are critical for large language model (LLM) based graders, yet manually crafting these prompts is often sub-optimal as LLMs can misinterpret expert guidelines or lack necessary domain specificity. Consequently, the field has moved toward automated prompt optimization to refine grading guidelines without the burden of manual trial and error. However, existing frameworks typically aggregate independent and unstructured error samples into a single update step, resulting in "rule dilution" where conflicting constraints weaken the model's grading logic. To address these limitations, we introduce Confusion-Aware Rubric Optimization (CARO), a novel framework that enhances accuracy and computational efficiency by structurally separating error signals. CARO leverages the confusion matrix to decompose monolithic error signals into distinct modes, allowing for the diagnosis and repair of specific misclassification patterns individually. By synthesizing targeted "fixing patches" for dominant error modes and employing a diversity-aware selection mechanism, the framework prevents guidance conflict and eliminates the need for resource-heavy nested refinement loops. Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods. These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.
137. 【2603.00434】RTLocating: Intent-aware RTL Localization for Hardware Design Iteration
链接:https://arxiv.org/abs/2603.00434
作者:Changwen Xing,Yanfeng Lu,Lei Qi,Chenxu Niu,Jie Li,Xi Wang,Yong Chen,Jun Yang
类目:Emerging Technologies (cs.ET); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Industrial chip development, Industrial chip, favoring localized, Register Transfer Level, inherently iterative
备注:
点击查看摘要
Abstract:Industrial chip development is inherently iterative, favoring localized, intent-driven updates over rewriting RTL from scratch. Yet most LLM-Aided Hardware Design (LAD) work focuses on one-shot synthesis, leaving this workflow underexplored. To bridge this gap, we for the first time formalize $\Delta$Spec-to-RTL localization, a multi-positive problem mapping natural language change requests ($\Delta$Spec) to the affected Register Transfer Level (RTL) syntactic blocks. We propose RTLocating, an intent-aware RTL localization framework, featuring a dynamic router that adaptively fuses complementary views from a textual semantic encoder, a local structural encoder, and a global interaction and dependency encoder (GLIDE). To enable scalable supervision, we introduce EvoRTL-Bench, the first industrial-scale benchmark for intent-code alignment derived from OpenTitan's Git history, comprising 1,905 validated requests and 13,583 $\Delta$Spec-RTL block pairs. On EvoRTL-Bench, RTLocating achieves 0.568 MRR and 15.08% R@1, outperforming the strongest baseline by +22.9% and +67.0%, respectively, establishing a new state-of-the-art for intent-driven localization in evolving hardware designs.
138. 【2603.00432】A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs
链接:https://arxiv.org/abs/2603.00432
作者:Anna Feldman,Libby Barak,Jing Peng
类目:Computation and Language (cs.CL)
关键词:versus inflectional form, order versus inflectional, word order versus, inflectional form, multilingual masked language
备注:
点击查看摘要
Abstract:We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.
139. 【2603.00426】LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
链接:https://arxiv.org/abs/2603.00426
作者:Cunyuan Yang,Dejuan Song,Xiaotao Pang,Qianqian Shen,Wenjie Nie,Yifan Huang,Lei Wu,Wei Han,Haishuai Wang,Jiajun Bu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:utilizing Multimodal Large, Multimodal Large Language, frequently encounters challenges, reports utilizing Multimodal, encounters challenges related
备注: 10 pages, 1 figure
点击查看摘要
Abstract:The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.
140. 【2603.00369】Policy Compliance of User Requests in Natural Language for AI Systems
链接:https://arxiv.org/abs/2603.00369
作者:Pedro Cisneros-Velarde
类目:Computation and Language (cs.CL)
关键词:users send requests, specific tasks, natural language, carrying out specific, user requests comply
备注:
点击查看摘要
Abstract:Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.
141. 【2603.00364】Distribution-Aware Companding Quantization of Large Language Models
链接:https://arxiv.org/abs/2603.00364
作者:Athul Radhakrishnan,Siddhant Mohan,Mahima Sachdeva
类目:Computation and Language (cs.CL)
关键词:GPT and Llama, language models, models, GPT, Llama
备注:
点击查看摘要
Abstract:Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.
142. 【2603.00359】How Large Language Models Get Stuck: Early structure with persistent errors
链接:https://arxiv.org/abs/2603.00359
作者:Alokesh Manna,William Snyder,Whitney Tabor
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:make Large Language, Large Language Model, Large Language, make Large, Meta OPT model
备注:
点击查看摘要
Abstract:Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model's preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order for the model to perform well. We probe this phenomenon using a mixture of qualitative (based on linguistic theory and the theory of Deep Learning) and quantitative (based on numerical testing) assessments. Our qualitative assessments indicate that only some BLiMP tests are meaningful guides. We conclude by articulating a hypothesis, the Bigram Hypothesis, which claims that the learning process will exhibit erroneous entrenchment if bigram statistics bias the model toward wrong distinctions early in training, and we describe a method (in progress) of testing the hypothesis on appropriately selected BLiMP classes.
143. 【2603.00314】When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
链接:https://arxiv.org/abs/2603.00314
作者:Bian Sun,Zhenjian Wang,Orvill de la Torre,Zirui Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:baseline model selection, Long Island University, healthcare settings, paper details, details the baseline
备注:
点击查看摘要
Abstract:This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including medical queries, concerns about their reliability have surfaced. A recent study by Long Island University highlighted that LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users. To address this, our research focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions. Our objective was to enhance the model's accuracy and precision in responding to medical queries. We fine-tuned the model using a supervised approach, emphasizing domain-specific nuances captured in the training data. In the best scenario, the model results should be reviewed and evaluated by real medical experts. Due to resource constraints, the performance of the fine-tuned model was evaluated using text similarity metrics. The fine-tuned model demonstrated significant improvements across all key dimensions except GPT-4's evaluation. The evaluations of ChatGPT4 are quite different from the quantitative results; here, we not only suggest, but also propose that the result should be evaluated by human medical experts.
144. 【2603.00307】From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction
链接:https://arxiv.org/abs/2603.00307
作者:Matic Korun
类目:Computation and Language (cs.CL)
关键词:geometric hallucination taxonomy, wrong-well convergence, coverage gaps, controlled induction, distinguish hallucination types
备注: 9 pages, 2 figures, appendices (reproducibility, sample generation, additional figures)
点击查看摘要
Abstract:We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.
145. 【2603.00296】Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
链接:https://arxiv.org/abs/2603.00296
作者:Xintong Li,Sha Li,Rongmei Lin,Hongye Jin,Linwei Li,Hejie Cui,Sarah Zhang,Chia-Yuan Chang,Kewei Cheng,Besnik Fetahu,Priyanka Nigam,Jingbo Shang,Bing Yin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:producing unnecessarily long, Large reasoning models, Large reasoning, test-time computation, producing unnecessarily
备注: Preprint
点击查看摘要
Abstract:Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
146. 【2603.00270】ransformers Remember First, Forget Last: Dual-Process Interference in LLMs
链接:https://arxiv.org/abs/2603.00270
作者:Sourav Chattaraj,Kanak Raj
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:encounter conflicting information, large language models, language models encounter, models encounter conflicting, memories survive
备注: 16 pages, 10 figures. Under review
点击查看摘要
Abstract:When large language models encounter conflicting information in context, which memories survive -- early or recent? We adapt classical interference paradigms from cognitive psychology to answer this question, testing 39 LLMs across diverse architectures and scales. Every model shows the same pattern: proactive interference (PI) dominates retroactive interference (RI) universally (Cohen's d = 1.73, p 0.0001), meaning early encodings are protected at the cost of recent information -- the opposite of human memory, where RI typically dominates. Three findings indicate that RI and PI reflect separate memory mechanisms. RI and PI are uncorrelated (R^2 = 0.044), rejecting a unified "memory capacity." Model size predicts RI resistance (R^2 = 0.49) but not PI (R^2 = 0.06, n.s.) -- only RI is capacity-dependent. And error analysis reveals distinct failure modes: RI failures are passive retrieval failures (51%), while PI failures show active primacy intrusion (56%); both show 1% hallucination. These patterns parallel the consolidation-retrieval distinction in cognitive science, suggesting that transformer attention creates a primacy bias with direct implications for interference-heavy applications.
147. 【2603.00196】Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models
链接:https://arxiv.org/abs/2603.00196
作者:Chung-ju Huang,Huiqiang Zhao,Yuanpeng He,Lijian Li,Wenpin Jiao,Zhi Jin,Peixuan Chen,Leye Wang
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:cloud-hosted Large Language, Large Language Models, Large Language, potential privacy breaches, cloud-hosted Large
备注: 19 pages, 5 figures
点击查看摘要
Abstract:The increasing reliance on cloud-hosted Large Language Models (LLMs) exposes sensitive client data, such as prompts and responses, to potential privacy breaches by service providers. Existing approaches fail to ensure privacy, maintain model performance, and preserve computational efficiency simultaneously. To address this challenge, we propose Talaria, a confidential inference framework that partitions the LLM pipeline to protect client data without compromising the cloud's model intellectual property or inference quality. Talaria executes sensitive, weight-independent operations within a client-controlled Confidential Virtual Machine (CVM) while offloading weight-dependent computations to the cloud GPUs. The interaction between these environments is secured by our Reversible Masked Outsourcing (ReMO) protocol, which uses a hybrid masking technique to reversibly obscure intermediate data before outsourcing computations. Extensive evaluations show that Talaria can defend against state-of-the-art token inference attacks, reducing token reconstruction accuracy from over 97.5% to an average of 1.34%, all while being a lossless mechanism that guarantees output identical to the original model without significantly decreasing efficiency and scalability. To the best of our knowledge, this is the first work that ensures clients' prompts and responses remain inaccessible to the cloud, while also preserving model privacy, performance, and efficiency.
148. 【2603.00186】RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
链接:https://arxiv.org/abs/2603.00186
作者:Srikumar Nayak
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:systems run nonstop, Financial systems run, systems run, run nonstop, stay reliable
备注: 6 pages, 2 fig and 2 tables
点击查看摘要
Abstract:Financial systems run nonstop and must stay reliable even during cyber incidents. Modern attacks move across many services (apps, APIs, identity, payment rails), so defenders must make a sequence of actions under time pressure. Most security tools still use fixed rules or static playbooks, which can be slow to adapt when the attacker changes behavior. Reinforcement learning (RL) is a good fit for sequential decisions, but much of the RL-in-finance literature targets trading and does not model real cyber response limits such as action cost, service disruption, and defender coordination across many assets. This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense. We model the enterprise attack surface as a Markov decision process (MDP) where states summarize alerts, asset exposure, and service health, and actions represent real response steps (e.g., isolate a host, rotate credentials, ratelimit an API, block an account, or trigger recovery). RLShield learns coordinated policies across multiple agents (assets or service groups) and optimizes a risk-sensitive objective that balances containment speed, business disruption, and response cost. We also include a game-aware evaluation that tests policies against adaptive attackers and reports operational outcomes, not only reward. Experiments show that RLShield reduces time-to-containment and residual exposure while keeping disruption within a fixed response budget, outperforming static rule baselines and single-agent RL under the same constraints. These results suggest that multi-agent, cost-aware RL can provide a deployable layer for automated response in financial security operations.
149. 【2603.00105】LIDS: LLM Summary Inference Under the Layered Lens
链接:https://arxiv.org/abs/2603.00105
作者:Dylan Park,Yingying Fan,Jinchi Lv
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME); Machine Learning (stat.ML)
关键词:gained significant attention, Large language models, natural language processing, gained significant, significant attention
备注: 48 pages, 15 figures
点击查看摘要
Abstract:Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.
150. 【2603.00086】Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
链接:https://arxiv.org/abs/2603.00086
作者:Ambre Marie(LaTIM),Thomas Bertin(DySoLab),Guillaume Dardenne(LaTIM),Gwenolé Quellec(LaTIM)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Automatic speech recognition, word error rates, Automatic speech, conversations remains challenging, medical conversations remains
备注:
点击查看摘要
Abstract:Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p 0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment.
151. 【2603.00084】DeepXiv-SDK: An Agentic Data Interface for Scientific Papers
链接:https://arxiv.org/abs/2603.00084
作者:Hongjin Qian,Ziyi Xia,Ze Liu,Jianlv Chen,Kun Luo,Minghao Qin,Chaofan Li,Lei Xiong,Sen Wang,Zhengyang Liang,Zheng Liu
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:data, LLM-agents are increasingly, accelerate the progress, data access, agentic data interface
备注: Project at [this https URL](https://github.com/DeepXiv/deepxiv_sdk)
点击查看摘要
Abstract:LLM-agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human-centric data on the Internet, such as HTML web-pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look-up. This gap motivates the development of \textit{an agentic data interface}, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost-aware manner. In this paper, we introduce DeepXiv-SDK, which offers a three-layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human-centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad-hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built-in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv-SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open-access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv-SDK is free to use with registration.
Comments:
Project at this https URL
Subjects:
Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2603.00084 [cs.DL]
(or
arXiv:2603.00084v2 [cs.DL] for this version)
https://doi.org/10.48550/arXiv.2603.00084
Focus to learn more
arXiv-issued DOI via DataCite</p>
152. 【2603.00082】Linguistic Uncertainty and Engagement in Arabic-Language X (formerly Twitter) Discourse
链接:https://arxiv.org/abs/2603.00082
作者:Mohamed Soufan
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:engagement remains underexplored, user engagement remains, remains underexplored, relationship with user, Linguistic uncertainty
备注: 15 pages, 1 figure, 1 table
点击查看摘要
Abstract:Linguistic uncertainty is a common feature of social media discourse, yet its relationship with user engagement remains underexplored, particularly in non-English contexts. Using a dataset of 16,695 Arabic-language tweets about Lebanon posted over a 35-day period, we examine whether tweets expressing linguistic uncertainty receive different levels and forms of engagement compared to certainty-marked tweets. We develop a lexicon-based, context-sensitive classifier to identify uncertainty markers and classify 29.9% of tweets as uncertain. Descriptive analyses indicate that uncertain tweets exhibit 51.5% higher mean total engagement (likes, retweets, and replies). Regression models controlling for tweet length, URL presence, and account verification status confirm a positive association between uncertainty and engagement (\b{eta} = 0.221, SE = 0.044, p 0.001), corresponding to approximately 25% higher expected engagement. The association is strongest for replies, followed by retweets and likes, suggesting a shift toward more conversational forms of engagement. Results are robust to alternative model specifications and adjustments for within-account correlation. These findings suggest that linguistic uncertainty may function as an interactional cue that encourages participatory engagement in Arabic-language social media discourse. The study contributes computational approaches for modeling linguistic features in large-scale, non-English digital communication.
153. 【2603.00077】Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
链接:https://arxiv.org/abs/2603.00077
作者:Delip Rao,Chris Callison-Burch
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:assessing text generation, Rubric-based evaluation, large language models, partial solutions, standard practice
备注: 43 pages
点击查看摘要
Abstract:Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $\kappa$, weighted $\kappa$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
154. 【2603.00072】Designing Explainable AI for Healthcare Reviews: Guidance on Adoption and Trust
链接:https://arxiv.org/abs/2603.00072
作者:Eman Alamoudi,Ellis Solaiman
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Patients increasingly rely, hinder effective decision-making, choosing healthcare providers, effective decision-making, increasingly rely
备注:
点击查看摘要
Abstract:Patients increasingly rely on online reviews when choosing healthcare providers, yet the sheer volume of these reviews can hinder effective decision-making. This paper summarises a mixed-methods study aimed at evaluating a proposed explainable AI system that analyses patient reviews and provides transparent explanations for its outputs. The survey (N=60) indicated broad optimism regarding usefulness (82% agreed it saves time; 78% that it highlights essentials), alongside strong demand for explainability (84% considered it important to understand why a review is classified; 82% said explanations would increase trust). Around 45% preferred combined text-and-visual explanations. Thematic analysis of open-ended survey responses revealed core requirements such as accuracy, clarity and simplicity, responsiveness, data credibility, and unbiased processing. In addition, interviews with AI experts provided deeper qualitative insights, highlighting technical considerations and potential challenges for different explanation methods. Drawing on TAM and trust in automation, the findings suggest that high perceived usefulness and transparent explanations promote adoption, whereas complexity and inaccuracy hinder it. This paper contributes actionable design guidance for layered, audience-aware explanations in healthcare review systems.
155. 【2603.00056】How effective are VLMs in assisting humans in inferring the quality of mental models from Multimodal short answers?
链接:https://arxiv.org/abs/2603.00056
作者:Pritam Sil,Durgaprasad Karnam,Vinay Reddy Venumuddala,Pushpak Bhattacharyya
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:STEM Mental models, assessing students' conceptual, STEM Mental, role in assessing, students' conceptual understanding
备注:
点击查看摘要
Abstract:STEM Mental models can play a critical role in assessing students' conceptual understanding of a topic. They not only offer insights into what students know but also into how effectively they can apply, relate to, and integrate concepts across various contexts. Thus, students' responses are critical markers of the quality of their understanding and not entities that should be merely graded. However, inferring these mental models from student answers is challenging as it requires deep reasoning skills. We propose MMGrader, an approach that infers the quality of students' mental models from their multimodal responses using concept graphs as an analytical framework. In our evaluation with 9 openly available models, we found that the best-performing models fall short of human-level performance. This is because they only achieved an accuracy of approximately 40%, a prediction error of 1.1 units, and a scoring distribution fairly aligned with human scoring patterns. With improved accuracy, these can be highly effective assistants to teachers in inferring the mental models of their entire classrooms, enabling them to do so efficiently and help improve their pedagogies more effectively by designing targeted help sessions and lectures that strengthen areas where students collectively demonstrate lower proficiency.
156. 【2603.00031】GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency
链接:https://arxiv.org/abs/2603.00031
作者:Changhao Wang,Jiaolong Yang,Xinhao Yao,Yunfei Yu,Peng Jiao,Lu Yu,Junpeng Fang,Riccardo Cantoro,Qing Cui,Jun Zhou
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, raw scaling volume, Large Language, scaling volume, Language Models
备注:
点击查看摘要
Abstract:The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on $3\times$ larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.
157. 【2603.00030】SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
链接:https://arxiv.org/abs/2603.00030
作者:Xiaoxin Shi,Jiaxin Wan,Linkang Dong,Wei Jiang,Yue Liu,Zengfeng Huang
类目:Computation and Language (cs.CL)
关键词:autoregressive decoding imposes, enables intelligent agents, limits real-time applications, LLM-based function calling, fundamental latency bottleneck
备注:
点击查看摘要
Abstract:LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
158. 【2603.00029】Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models
链接:https://arxiv.org/abs/2603.00029
作者:Youngji Roh,Hyunjin Cho,Jaehyung Kim
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, exhibit highly anisotropic, anisotropic internal representations
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.
159. 【2603.00028】EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal
链接:https://arxiv.org/abs/2603.00028
作者:Samah Fodeh,Yan Wang,Linhai Ma,Srivani Talakokkul,Jordan M. Alpert,Sarah Schellhorn
类目:Computation and Language (cs.CL)
关键词:Effective communication, Subcode Classification, outcomes and adherence, Classification, health care
备注:
点击查看摘要
Abstract:Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering
160. 【2603.00026】ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents
链接:https://arxiv.org/abs/2603.00026
作者:Xiaohui Zhang,Zequn Sun,Chengyuan Yang,Yaqin Jin,Yazhong Zhang,Wei Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Effective memory management, large language model, handling long-term interactions, Effective memory, language model
备注:
点击查看摘要
Abstract:Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive "recorders" and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.
161. 【2603.00025】AB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation
链接:https://arxiv.org/abs/2603.00025
作者:Samah Fodeh,Linhai Ma,Ganesh Puthiaraju,Srivani Talakokkul,Afshan Khan,Ashley Hagaman,Sarah R. Lowe,Aimee Kendall Roundtree
类目:Computation and Language (cs.CL)
关键词:Direct Preference Optimization, offline post-SFT method, aligning language models, Direct Preference, preference pairs
备注:
点击查看摘要
Abstract:Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.
162. 【2603.00024】Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs
链接:https://arxiv.org/abs/2603.00024
作者:Sean W. Kelley,Christoph Riedl
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, sycophantic behavior, uncritically conforming
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are prone to sycophantic behavior, uncritically conforming to user beliefs. As models increasingly condition responses on user-specific context (personality traits, preferences, conversation history), they gain information to tailor agreement more effectively. Understanding how personalization modulates sycophancy is critical, yet systematic evaluation across models and contexts remains limited. We present a rigorous evaluation of personalization's impact on LLM sycophancy across nine frontier models and five benchmark datasets spanning advice, moral judgment, and debate contexts. We find that personalization generally increases affective alignment (emotional validation, hedging/deference), but affects epistemic alignment (belief adoption, position stability, resistance to influence) with context-dependent role modulation. When the LLM's role is to give advice, personalization strengthens epistemic independence (models challenge user presuppositions). When its role is that of a social peer, personalization decreases epistemic independence. In this role, extensively personalized user challenges causing LLMs to abandon their position at significantly higher rates. Robustness tests confirm that the effects are driven by personalized conditioning, not by additional input tokens per se or demographic information alone. Our work provides measurement frameworks for evaluating personalized AI systems, demonstrates the necessity of role-sensitive evaluation, and establishes a novel benchmark to assess goal alignment.
163. 【2603.00022】Noise reduction in BERT NER models for clinical entity extraction
链接:https://arxiv.org/abs/2603.00022
作者:Kuldeep Jiwani,Yash K Jeengar,Ayush Dhaka
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Named Entity Recognition, clinical entity extraction, NER, notes and reports, utmost importance
备注:
点击查看摘要
Abstract:Precision is of utmost importance in the realm of clinical entity extraction from clinical notes and reports. Encoder Models fine-tuned for Named Entity Recognition (NER) are an efficient choice for this purpose, as they don't hallucinate. We pre-trained an in-house BERT over clinical data and then fine-tuned it for NER. These models performed well on recall but could not close upon the high precision range, needed for clinical models. To address this challenge, we developed a Noise Removal model that refines the output of NER. The NER model assigns token-level entity tags along with probability scores for each token. Our Noise Removal (NR) model then analyzes these probability sequences and classifies predictions as either weak or strong. A naïve approach might involve filtering predictions based on low probability values; however, this method is unreliable. Owing to the characteristics of the SoftMax function, Transformer based architectures often assign disproportionately high confidence scores even to uncertain or weak predictions, making simple thresholding ineffective. To address this issue, we adopted a supervised modeling strategy in which the NR model leverages advanced features such as the Probability Density Map (PDM). The PDM captures the Semantic-Pull effect observed within Transformer embeddings, an effect that manifests in the probability distributions of NER class predictions across token sequences. This approach enables the model to classify predictions as weak or strong with significantly improved accuracy. With these NR models we were able to reduce False Positives across various clinical NER models by 50\% to 90\%.
164. 【2603.00021】From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization
链接:https://arxiv.org/abs/2603.00021
作者:Ruangrin Ldallitsakool,Margarita Bugueño,Gerard de Melo
类目:Computation and Language (cs.CL)
关键词:automatically construct graph-based, graph-based document representations, construct graph-based document, paper proposes, proposes a data-driven
备注:
点击查看摘要
Abstract:This paper proposes a data-driven method to automatically construct graph-based document representations. Building upon the recent work of Bugueño and de Melo (2025), we leverage the dynamic sliding-window attention module to effectively capture local and mid-range semantic dependencies between sentences, as well as structural relations within documents. Graph Attention Networks (GATs) trained on our learned graphs achieve competitive results on document classification while requiring lower computational resources than previous approaches. We further present an exploratory evaluation of the proposed graph construction method for extractive document summarization, highlighting both its potential and current limitations. The implementation of this project can be found on GitHub.
165. 【2603.00003】Commitment Checklist: Auditing Author Commitments in Peer Review
链接:https://arxiv.org/abs/2603.00003
作者:Chung-Chi Chen,Iryna Gurevych
类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:release code, review author responses, responses often include, clarify content, Author Commitment Checklist
备注:
点击查看摘要
Abstract:Peer review author responses often include commitments to add experiments, release code, or clarify content in the final paper. Yet, there is currently no systematic mechanism to ensure authors fulfill these promises. In this position paper, we present a large-scale audit of author commitments using large language models (LLMs) to compare rebuttals against camera-ready versions. Analyzing the commitments from ICLR-2025 and EMNLP-2024, we find that while a majority of promised changes are implemented, a significant share (about 25%) are not, with "missing experiments" and other high-impact items among the most frequently unfulfilled. We demonstrate that LLM-based tools can feasibly detect the promises. Finally, we propose the idea of Author Commitment Checklist, which would alert authors and organizers to unaddressed promises, increasing accountability and strengthening the integrity of the peer review process. We discuss the benefits of this practice and advocate for its adoption in future conferences.
166. 【2502.16612】MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
链接:https://arxiv.org/abs/2502.16612
作者:Mohamed Bayan Kmainasi,Abul Hasnat,Md Arid Hasan,Ali Ezzat Shahroor,Firoj Alam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:social media presents, media presents significant, presents significant challenges, hate speech, moderating complex
备注: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, hateful meme, multimodality, text, images
点击查看摘要
Abstract:The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (this https URL).
167. 【2603.01270】VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
链接:https://arxiv.org/abs/2603.01270
作者:Yanir Marmor,Arad Zulti,David Krongauz,Adam Gabet,Yoad Snapir,Yair Lifshitz,Eran Segal
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
关键词:rigorous longitudinal evaluation, fundamental challenge, face a fundamental, human voice, datasets support rigorous
备注: 4 pages, 5 figures, 2 tables
点击查看摘要
Abstract:Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
信息检索
1. 【2603.02153】Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment
链接:https://arxiv.org/abs/2603.02153
作者:Luigi Medrano,Arush Verma,Mukul Chhabra
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, commonly adopt retrieval, higher recall leads, reciprocal rank fusion, systems commonly adopt
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better answer quality. While these methods show consistent gains in isolated retrieval benchmarks, their effectiveness under realistic production constraints remains underexplored. In this work, we evaluate retrieval fusion in a production-style RAG pipeline operating over an enterprise knowledge base, with fixed retrieval depth, re-ranking budgets, and latency constraints. Across multiple fusion configurations, we find that retrieval fusion does increase raw recall, but these gains are largely neutralized after re-ranking and truncation. In our setting, fusion variants fail to outperform single-query baselines on KB-level Top-$k$ accuracy, with Hit@10 decreasing from $0.51$ to $0.48$ in several configurations. Moreover, fusion introduces additional latency overhead due to query rewriting and larger candidate sets, without corresponding improvements in downstream effectiveness. Our analysis suggests that recall-oriented fusion techniques exhibit diminishing returns once realistic re-ranking limits and context budgets are applied. We conclude that retrieval-level improvements do not reliably translate into end-to-end gains in production RAG systems, and argue for evaluation frameworks that jointly consider retrieval quality, system efficiency, and downstream impact.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2603.02153 [cs.IR]
(or
arXiv:2603.02153v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.02153
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2603.02137】NextAds: Towards Next-generation Personalized Video Advertising
链接:https://arxiv.org/abs/2603.02137
作者:Yiyan Xu,Ruoxuan Xia,Wuqiang Zheng,Fengbin Zhu,Wenjie Wang,Fuli Feng
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:digital advertising landscape, personalized video advertising, online video consumption, video advertising, rapid growth
备注:
点击查看摘要
Abstract:With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, selecting the optimal one from a small set of professionally pre-produced creatives for each user. Such static and finite inventories limits both the granularity and the timeliness of personalization, and prevents the creatives from being continuously refined based on online user feedback. Recent advances in generative AI make it possible to move beyond retrieval toward optimizing video creatives in a continuous space at serving time. In this light, we propose NextAds, a generation-based paradigm for next-generation personalized video advertising, and conceptualize NextAds with four core components. To enable comparable research progress, we formulate two representative tasks: personalized creative generation and personalized creative integration, and introduce corresponding lightweight benchmarks. To assess feasibility, we instantiate end-to-end pipelines for both tasks and conduct initial exploratory experiments, demonstrating that GenAI can generate and integrate personalized creatives with encouraging performance. Moreover, we discuss the key challenges and opportunities under this paradigm, aiming to provide actionable insights for both researchers and practitioners and to catalyze progress in personalized video advertising.
Subjects:
Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.02137 [cs.IR]
(or
arXiv:2603.02137v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.02137
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
3. 【2603.02098】OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
链接:https://arxiv.org/abs/2603.02098
作者:Chuong Huynh,Manh Luong,Abhinav Shrivastava
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:retrieve desired targets, desired targets, retrieval, retrieve desired, Multimodal retrieval
备注: CVPR 2026. Project link: [this https URL](https://github.com/hmchuong/omniret)
点击查看摘要
Abstract:Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model's omni-modal embedding capacity.
4. 【2603.01926】MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation
链接:https://arxiv.org/abs/2603.01926
作者:Xinxin Dong,Haokai Ma,Yuze Zheng,Yongfu Zha,Yonghui Yang,Xiaodong Wang
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Micro-video recommendation aims, capture user preferences, aims to capture, context information, Micro-video recommendation
备注:
点击查看摘要
Abstract:Micro-video recommendation aims to capture user preferences from the collaborative and context information of the interacted micro-videos, thereby predicting the appropriate videos. This target is often hindered by the inherent noise within multimodal content and unreliable implicit feedback, which weakens the correspondence between behaviors and underlying interests. While conventional works have predominantly approached such scenario through behavior-augmented modeling and content-centric multimodal analysis, these paradigms can inadvertently give rise to two non-trivial challenges: preference-irrelative video representation extraction and inherent modality conflicts. To address these issues, we propose a Multi-granularity sequential modeling method via hierarchical diffusion models for micro-video Recommendation (MealRec), which simultaneously considers temporal correlations during preference modeling from intra- and inter-video perspectives. Specifically, we first propose Temporal-guided Content Diffusion (TCD) to refine video representations under intra-video temporal guidance and personalized collaborative signals to emphasize salient content while suppressing redundancy. To achieve the semantically coherent preference modeling, we further design the Noise-unconditional Preference Denoising (NPD) to recovers informative user preferences from corrupted states under the blind denoising. Extensive experiments and analyses on four micro-video datasets from two platforms demonstrate the effectiveness, universality, and robustness of our MealRec, further uncovering the effective mechanism of our proposed TCD and NPD. The source code and corresponding dataset will be available upon acceptance.
5. 【2603.01791】Semantic Novelty Trajectories in 80,000 Books: A Cross-Corpus Embedding Analysis
链接:https://arxiv.org/abs/2603.01791
作者:Fred Zimmerman
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:compression progress theory, English-language publishing, apply Schmidhuber compression, Schmidhuber compression progress, centuries of English-language
备注: 12 pages, 4 figures, 5 tables
点击查看摘要
Abstract:I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at this https URL.
6. 【2603.01710】Legal RAG Bench: an end-to-end benchmark for legal RAG
链接:https://arxiv.org/abs/2603.01710
作者:Abdur-Rahman Butler,Umar Butler
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Legal RAG Bench, Legal RAG, legal RAG systems, RAG Bench, introduce Legal RAG
备注: 13 pages, 3 figures, 4 tables
点击查看摘要
Abstract:We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
7. 【2603.01666】Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
链接:https://arxiv.org/abs/2603.01666
作者:Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Jiahao Huo,Yu Huang,James Kwok,Xuming Hu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Visual Document Retrieval, visually-rich documents requires, Harnessing the full, challenge in Visual, documents requires retrieval
备注: Under review
点击查看摘要
Abstract:Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
8. 【2603.01590】IDProxy: Cold-Start CTR Prediction for Ads and Recommendation at Xiaohongshu with Multimodal LLMs
链接:https://arxiv.org/abs/2603.01590
作者:Yubin Zhang,Haiming Xu,Guillaume Salha-Galvan,Ruiyan Han,Feiyang Xiao,Yanhua Huang,Li Lin,Yang Luo,Yao Hu
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:recommendation systems rely, systems rely heavily, Click-through rate, item cold-start settings, cold-start settings
备注:
点击查看摘要
Abstract:Click-through rate (CTR) models in advertising and recommendation systems rely heavily on item ID embeddings, which struggle in item cold-start settings. We present IDProxy, a solution that leverages multimodal large language models (MLLMs) to generate proxy embeddings from rich content signals, enabling effective CTR prediction for new items without usage data. These proxies are explicitly aligned with the existing ID embedding space and are optimized end-to-end under CTR objectives together with the ranking model, allowing seamless integration into existing large-scale ranking pipelines. Offline experiments and online A/B tests demonstrate the effectiveness of IDProxy, which has been successfully deployed in both Content Feed and Display Ads features of Xiaohongshu's Explore Feed, serving hundreds of millions of users daily.
9. 【2603.01536】CLEAR: Null-Space Projection for Cross-Modal De-Redundancy in Multimodal Recommendation
链接:https://arxiv.org/abs/2603.01536
作者:Hao Zhan,Yihui Wang,Yonghui Yang,Danyang Yue,Yu Wang,Pengyang Shao,Fei Shen,Fei Liu,Le Wu
类目:Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:enhancing collaborative filtering, heterogeneous content modalities, incorporating heterogeneous content, Multimodal, paradigm for enhancing
备注:
点击查看摘要
Abstract:Multimodal recommendation has emerged as an effective paradigm for enhancing collaborative filtering by incorporating heterogeneous content modalities. Existing multimodal recommenders predominantly focus on reinforcing cross-modal consistency to facilitate multimodal fusion. However, we observe that multimodal representations often exhibit substantial cross-modal redundancy, where dominant shared components overlap across modalities. Such redundancy can limit the effective utilization of complementary information, explaining why incorporating additional modalities does not always yield performance improvements. In this work, we propose CLEAR, a lightweight and plug-and-play cross-modal de-redundancy approach for multimodal recommendation. Rather than enforcing stronger cross-modal alignment, CLEAR explicitly characterizes the redundant shared subspace across modalities by modeling cross-modal covariance between visual and textual representations. By identifying dominant shared directions via singular value decomposition and projecting multimodal features onto the complementary null space, CLEAR reshapes the multimodal representation space by suppressing redundant cross-modal components while preserving modality-specific information. This subspace-level projection implicitly regulates representation learning dynamics, preventing the model from repeatedly amplifying redundant shared semantics during training. Notably, CLEAR can be seamlessly integrated into existing multimodal recommenders without modifying their architectures or training objectives. Extensive experiments on three public benchmark datasets demonstrate that explicitly reducing cross-modal redundancy consistently improves recommendation performance across a wide range of multimodal recommendation models.
10. 【2603.01493】PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
链接:https://arxiv.org/abs/2603.01493
作者:Tianyi Xu,Rong Shan,Junjie Wu,Jiadeng Huang,Teng Wang,Jiachen Zhu,Wenteng Chen,Minxin Tu,Quantao Dou,Zhaoxiang Wang,Changwang Zhang,Weinan Zhang,Jun Wang,Jianghao Lin
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:ecological archives defined, photo retrieval non-trivial, Personal photo albums, ecological archives, personalized photo retrieval
备注: Under review
点击查看摘要
Abstract:Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
11. 【2603.01471】Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
链接:https://arxiv.org/abs/2603.01471
作者:Jiahan Chen,Da Li,Hengran Zhang,Yinqiong Cai,Lixin Su,Jiafeng Guo,Daiting Shi,Dawei Yin,Keping Bi
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:multimodal large language, large language models, yielded significant performance, significant performance improvements, Multimodal embedding
备注:
点击查看摘要
Abstract:Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS embeddings. This drives the multimodal model to compress the semantic information of the input into the EOS token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.
12. 【2603.01455】From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
链接:https://arxiv.org/abs/2603.01455
作者:Niu Lian,Yuting Wang,Hanshu Yao,Jinpeng Wang,Bin Chen,Yaowei Wang,Min Zhang,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:impressive short-term reasoning, human cognitive efficiency, large language models, demonstrated impressive short-term, long-horizon video understanding
备注: TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary. 16 pages, 7 figures, 7 tables
点击查看摘要
Abstract:While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at this https URL.
13. 【2603.01425】LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval
链接:https://arxiv.org/abs/2603.01425
作者:Jiajie Jin,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Yutao Zhu,Zhicheng Dou
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:fundamentally transformed dense, generative architectures, transformed dense retrieval, fundamentally transformed, discriminative encoders
备注: Under Review
点击查看摘要
Abstract:LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
14. 【2603.01417】ReFeed: Retrieval Feedback-Guided Dataset Construction for Style-Aware Query Rewriting
链接:https://arxiv.org/abs/2603.01417
作者:Jiyoon Myung,Jungki Son,Kyungro Lee,Jihyeon Park,Joohyung Han
类目:Information Retrieval (cs.IR)
关键词:queries differ stylistically, user queries differ, differ stylistically, user queries, reformulating user queries
备注: Accepted at the Workshop on New Frontiers in Information Retrieval (AAAI 2026)
点击查看摘要
Abstract:Retrieval systems often fail when user queries differ stylistically or semantically from the language used in domain documents. Query rewriting has been proposed to bridge this gap, improving retrieval by reformulating user queries into semantically equivalent forms. However, most existing approaches overlook the stylistic characteristics of target documents-their domain-specific phrasing, tone, and structure-which are crucial for matching real-world data distributions. We introduce a retrieval feedback-driven dataset generation framework that automatically identifies failed retrieval cases, leverages large language models to rewrite queries in the style of relevant documents, and verifies improvement through re-retrieval. The resulting corpus of (original, rewritten) query pairs enables the training of rewriter models that are explicitly aware of document style and retrieval feedback. This work highlights a new direction in data-centric information retrieval, emphasizing how feedback loops and document-style alignment can enhance the reasoning and adaptability of RAG systems in real-world, domain-specific contexts.
15. 【2603.01241】ARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents
链接:https://arxiv.org/abs/2603.01241
作者:Junda Wang,Zonghai Tao,Hansi Zeng,Zhichao Yang,Hamed Zamani,Hong Yu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Complex clinical decision, model lacks facts, Complex clinical, clinical decision making, lacks facts
备注:
点击查看摘要
Abstract:Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model's intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.
16. 【2603.01082】Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
链接:https://arxiv.org/abs/2603.01082
作者:Xuan Lu,Kangle Li,Haohang Huang,Rui Meng,Wenjun Zeng,Xiaoyu Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Recent advances, multimodal large language, large language models, enabling systems, large language
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at this https URL
17. 【2603.00980】Beyond the Flat Sequence: Hierarchical and Preference-Aware Generative Recommendations
链接:https://arxiv.org/abs/2603.00980
作者:Zerui Chen,Heng Chang,Tianying Liu,Chuantian Zhou,Yi Cao,Jiandong Ding,Ming Liu,Bing Qin
类目:Information Retrieval (cs.IR)
关键词:Sequential Transduction Unit, Hierarchical Sequential Transduction, Transduction Unit, Sequential Transduction, user interaction sequences
备注: Accepted to the ACM Web Conference 2026 (WWW '26). 9 pages, 9 figures. Zerui Chen and Heng Chang contributed equally to this work
点击查看摘要
Abstract:Generative Recommenders (GRs), exemplified by the Hierarchical Sequential Transduction Unit (HSTU), have emerged as a powerful paradigm for modeling long user interaction sequences. However, we observe that their "flat-sequence" assumption overlooks the rich, intrinsic structure of user behavior. This leads to two key limitations: a failure to capture the temporal hierarchy of session-based engagement, and computational inefficiency, as dense attention introduces significant noise that obscures true preference signals within semantically sparse histories, which deteriorates the quality of the learned representations. To this end, we propose a novel framework named HPGR (Hierarchical and Preference-aware Generative Recommender), built upon a two-stage paradigm that injects these crucial structural priors into the model to handle the drawback. Specifically, HPGR comprises two synergistic stages. First, a structure-aware pre-training stage employs a session-based Masked Item Modeling (MIM) objective to learn a hierarchically-informed and semantically rich item representation space. Second, a preference-aware fine-tuning stage leverages these powerful representations to implement a Preference-Guided Sparse Attention mechanism, which dynamically constrains computation to only the most relevant historical items, enhancing both efficiency and signal-to-noise ratio. Empirical experiments on a large-scale proprietary industrial dataset from APPGallery and an online A/B test verify that HPGR achieves state-of-the-art performance over multiple strong baselines, including HSTU and MTGR.
18. 【2603.00854】GeMi: A Graph-based, Multimodal Recommendation System for Narrative Scroll Paintings
链接:https://arxiv.org/abs/2603.00854
作者:Haimonti Dutta,Pruthvi Moluguri,Jin Dai,Saurabh Amarnath Mahindre
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:users discover interesting, recommendation system, Recommendation, Recommendation Systems, effective in managing
备注:
点击查看摘要
Abstract:Recommendation Systems are effective in managing the ever-increasing amount of multimodal data available today and help users discover interesting new items. These systems can handle various media types such as images, text, audio, and video data, and this has made it possible to handle content-based recommendation utilizing features extracted from items while also incorporating user preferences. Graph Neural Network (GNN)-based recommendation systems are a special class of recommendation systems that can handle relationships between items and users, making them particularly attractive for content-based recommendations. Their popularity also stems from the fact that they use advanced machine learning techniques, such as deep learning on graph-structured data, to exploit user-to-item interactions. The nodes in the graph can access higher-order neighbor information along with state-of-the-art vision-language models for processing multimodal content, and there are well-designed algorithms for embedding, message passing, and propagation. In this work, we present the design of a GNN-based recommendation system on a novel data set collected from field research. Designed for an endangered performing art form, the recommendation system uses multimodal content (text and image data) to suggest similar paintings for viewing and purchase. To the best of our knowledge, there is no recommendation system designed for narrative scroll paintings -- our work therefore serves several purposes, including art conservation, a data storage system for endangered art objects, and a state-of-the-art recommendation system that leverages both the novel characteristics of the data and preferences of the user population interested in narrative scroll paintings.
19. 【2603.00846】ny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models
链接:https://arxiv.org/abs/2603.00846
作者:Yichao Wu,Penghao Liang,Yafei Xiang,Mengwei Yuan,Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:grounds Large Language, mitigate factual hallucinations, Retrieval-Augmented Generation, Large Language Models, grounds Large
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.
20. 【2603.00801】he Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents
链接:https://arxiv.org/abs/2603.00801
作者:Shrey Shah,Levent Ozgur
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Language agents increasingly, agents increasingly act, Language agents, increasingly act, act as web-enabled
备注: Submitted to ICML 2026, currently under review
点击查看摘要
Abstract:Language agents increasingly act as web-enabled systems that search, browse, and synthesize information from diverse sources. However, these sources can include unreliable or adversarial content, and the robustness of agents to adversarial ranking - where misleading information appears prominently in search results - remains poorly understood. Existing benchmarks evaluate functional navigation or static factuality but cannot causally isolate this vulnerability, and current mitigation strategies for retrieval-augmented generation remain largely untested under such conditions. We introduce Synthetic Web Benchmark, a procedurally generated environment comprising thousands of hyperlinked articles with ground-truth labels for credibility and factuality, process-level interaction traces, and contamination filtering to eliminate training-data leakage. By injecting a single high-plausibility misinformation article into a controllable search rank, we measure the causal effect of adversarial exposure in six frontier models. The results reveal catastrophic failures: accuracy collapses despite unlimited access to truthful sources, with minimal search escalation and severe miscalibration. These findings expose fundamental limitations in how current frontier models handle conflicting information, with immediate implications for deployment in high-stakes domains. Our benchmark enables systematic analysis of these failure modes and provides a controlled testbed for evaluating mitigation strategies under adversarial ranking - a gap in current research. This work establishes a reproducible baseline for developing search-robust and epistemically humble agents capable of resisting manipulation in high-stakes domains.
21. 【2603.00700】SODA: Semantic-Oriented Distributional Alignment for Generative Recommendation
链接:https://arxiv.org/abs/2603.00700
作者:Ziqi Xue,Dingxian Wang,Yimeng Bai,Shuai Zhu,Jialei Li,Xiaoyan Zhao,Frank Yang,Andrew Rabinovich,Yang Zhang,Pablo N. Mendes
类目:Information Retrieval (cs.IR)
关键词:compact token space, alternative to traditional, pipelines by operating, token space, recommendation has emerged
备注:
点击查看摘要
Abstract:Generative recommendation has emerged as a scalable alternative to traditional retrieve-and-rank pipelines by operating in a compact token space. However, existing methods mainly rely on discrete code-level supervision, which leads to information loss and limits the joint optimization between the tokenizer and the generative recommender. In this work, we propose a distribution-level supervision paradigm that leverages probability distributions over multi-layer codebooks as soft and information-rich representations. Building on this idea, we introduce Semantic-Oriented Distributional Alignment (SODA), a plug-and-play contrastive supervision framework based on Bayesian Personalized Ranking, which aligns semantically rich distributions via negative KL divergence while enabling end-to-end differentiable training. Extensive experiments on multiple real-world datasets demonstrate that SODA consistently improves the performance of various generative recommender backbones, validating its effectiveness and generality. Codes will be available upon acceptance.
22. 【2603.00638】RAIE: Region-Aware Incremental Preference Editing with LoRA for LLM-based Recommendation
链接:https://arxiv.org/abs/2603.00638
作者:Jin Zeng,Yupeng Qi,Hui Li,Chengming Li,Ziyu Lyu,Lixin Cui,Lu Bai
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large language models, Large language, recommender systems, increasingly adopted, Large
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly adopted as the backbone of recommender systems. However, user-item interactions in real-world scenarios are non-stationary, making preference drift over time inevitable. Existing model update strategies mainly rely on global fine-tuning or pointwise editing, but they face two fundamental challenges: (i) imbalanced update granularity, where global updates perturb behaviors unrelated to the target while pointwise edits fail to capture broader preference shifts; (ii) unstable incremental updates, where repeated edits interfere with prior adaptations, leading to catastrophic forgetting and inconsistent recommendations. To address these issues, we propose Region-Aware Incremental Editing (RAIE), a plug-in framework that freezes the backbone model and performs region-level updates. RAIE first constructs semantically coherent preference regions via spherical k-means in the representation space. It then assigns incoming sequences to regions via confidence-aware gating and performs three localized edit operations - Update, Expand, and Add - to dynamically revise the affected region. Each region is equipped with a dedicated Low-Rank Adaptation (LoRA) module, which is trained only on the region's updated data. During inference, RAIE routes each user sequence to its corresponding region and activates the region-specific adapter for prediction. Experiments on two benchmark datasets under a time-sliced protocol that segments data into Set-up (S), Finetune (F), and Test (T) show that RAIE significantly outperforms state-of-the-art baselines while effectively mitigating forgetting. These results demonstrate that region-aware editing offers an accurate and scalable mechanism for continual adaptation in dynamic recommendation scenarios. Our code is available at this https URL.
23. 【2603.00632】Stop Treating Collisions Equally: Qualification-Aware Semantic ID Learning for Recommendation at Industrial Scale
链接:https://arxiv.org/abs/2603.00632
作者:Zheng Hu,Yuxin Chen,Yongsen Pan,Xu Yuan,Yuting Yin,Daoyuan Wang,Boyang Xia,Zefei Luo,Hongyang Wang,Songhao Ni,Dongxu Liang,Jun Wang,Shimin Cai,Tao Zhou,Fuji Ren,Wenwu Ou
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:compact discrete representations, discrete representations derived, multimodal item features, generative recommendation, compact discrete
备注:
点击查看摘要
Abstract:Semantic IDs (SIDs) are compact discrete representations derived from multimodal item features, serving as a unified abstraction for ID-based and generative recommendation. However, learning high-quality SIDs remains challenging due to two issues. (1) Collision problem: the quantized token space is prone to collisions, in which semantically distinct items are assigned identical or overly similar SID compositions, resulting in semantic entanglement. (2) Collision-signal heterogeneity: collisions are not uniformly harmful. Some reflect genuine conflicts between semantically unrelated items, while others stem from benign redundancy or systematic data effects. To address these challenges, we propose Qualification-Aware Semantic ID Learning (QuaSID), an end-to-end framework that learns collision-qualified SIDs by selectively repelling qualified conflict pairs and scaling the repulsion strength by collision severity. QuaSID consists of two mechanisms: Hamming-guided Margin Repulsion, which translates low-Hamming SID overlaps into explicit, severity-scaled geometric constraints on the encoder space; and Conflict-Aware Valid Pair Masking, which masks protocol-induced benign overlaps to denoise repulsion supervision. In addition, QuaSID incorporates a dual-tower contrastive objective to inject collaborative signals into tokenization. Experiments on public benchmarks and industrial data validate QuaSID. On public datasets, QuaSID consistently outperforms strong baselines, improving top-K ranking quality by 5.9% over the best baseline while increasing SID composition diversity. In an online A/B test on Kuaishou e-commerce with a 5% traffic split, QuaSID increases ranking GMV-S2 by 2.38% and improves completed orders on cold-start retrieval by up to 6.42%. Finally, we show that the proposed repulsion loss is plug-and-play and enhances a range of SID learning frameworks across datasets.
24. 【2603.00434】RTLocating: Intent-aware RTL Localization for Hardware Design Iteration
链接:https://arxiv.org/abs/2603.00434
作者:Changwen Xing,Yanfeng Lu,Lei Qi,Chenxu Niu,Jie Li,Xi Wang,Yong Chen,Jun Yang
类目:Emerging Technologies (cs.ET); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Industrial chip development, Industrial chip, favoring localized, Register Transfer Level, inherently iterative
备注:
点击查看摘要
Abstract:Industrial chip development is inherently iterative, favoring localized, intent-driven updates over rewriting RTL from scratch. Yet most LLM-Aided Hardware Design (LAD) work focuses on one-shot synthesis, leaving this workflow underexplored. To bridge this gap, we for the first time formalize $\Delta$Spec-to-RTL localization, a multi-positive problem mapping natural language change requests ($\Delta$Spec) to the affected Register Transfer Level (RTL) syntactic blocks. We propose RTLocating, an intent-aware RTL localization framework, featuring a dynamic router that adaptively fuses complementary views from a textual semantic encoder, a local structural encoder, and a global interaction and dependency encoder (GLIDE). To enable scalable supervision, we introduce EvoRTL-Bench, the first industrial-scale benchmark for intent-code alignment derived from OpenTitan's Git history, comprising 1,905 validated requests and 13,583 $\Delta$Spec-RTL block pairs. On EvoRTL-Bench, RTLocating achieves 0.568 MRR and 15.08% R@1, outperforming the strongest baseline by +22.9% and +67.0%, respectively, establishing a new state-of-the-art for intent-driven localization in evolving hardware designs.
25. 【2603.00416】MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation
链接:https://arxiv.org/abs/2603.00416
作者:Rong Shan,Aofan Yu,Bo Chen,Kuo Cai,Qiang Luo,Ruiming Tang,Han Li,Weiwen Liu,Weinan Zhang,Jianghao Lin
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:increasingly emphasizing scaling, leveraging larger architectures, emphasizing scaling, leveraging larger, improve personalization
备注: Under Review
点击查看摘要
Abstract:Recommender systems (RecSys) are increasingly emphasizing scaling, leveraging larger architectures and more interaction data to improve personalization. Yet, despite the optimizer's pivotal role in training, modern RecSys pipelines almost universally default to Adam/AdamW, with limited scrutiny of whether these choices are truly optimal for recommendation. In this work, we revisit optimizer design for scalable recommendation and introduce MuonRec, the first framework that brings the recently proposed Muon optimizer to RecSys training. Muon performs orthogonalized momentum updates for 2D weight matrices via Newton-Schulz iteration, promoting diverse update directions and improving optimization efficiency. We develop an open-source training recipe for recommendation models and evaluate it across both traditional sequential recommenders and modern generative recommenders. Extensive experiments demonstrate that MuonRec reduces converged training steps by an average of 32.4\% while simultaneously improving final ranking quality. Specifically, MuonRec yields consistent relative gains in NDCG@10, averaging 12.6\% across all settings, with particularly pronounced improvements in generative recommendation models. These results consistently outperform strong Adam/AdamW baselines, positioning Muon as a promising new optimizer standard for RecSys training. Our code is available.
26. 【2603.00270】ransformers Remember First, Forget Last: Dual-Process Interference in LLMs
链接:https://arxiv.org/abs/2603.00270
作者:Sourav Chattaraj,Kanak Raj
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:encounter conflicting information, large language models, language models encounter, models encounter conflicting, memories survive
备注: 16 pages, 10 figures. Under review
点击查看摘要
Abstract:When large language models encounter conflicting information in context, which memories survive -- early or recent? We adapt classical interference paradigms from cognitive psychology to answer this question, testing 39 LLMs across diverse architectures and scales. Every model shows the same pattern: proactive interference (PI) dominates retroactive interference (RI) universally (Cohen's d = 1.73, p 0.0001), meaning early encodings are protected at the cost of recent information -- the opposite of human memory, where RI typically dominates. Three findings indicate that RI and PI reflect separate memory mechanisms. RI and PI are uncorrelated (R^2 = 0.044), rejecting a unified "memory capacity." Model size predicts RI resistance (R^2 = 0.49) but not PI (R^2 = 0.06, n.s.) -- only RI is capacity-dependent. And error analysis reveals distinct failure modes: RI failures are passive retrieval failures (51%), while PI failures show active primacy intrusion (56%); both show 1% hallucination. These patterns parallel the consolidation-retrieval distinction in cognitive science, suggesting that transformer attention creates a primacy bias with direct implications for interference-heavy applications.
27. 【2603.00267】Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking
链接:https://arxiv.org/abs/2603.00267
作者:Shuzhi Gong,Richard O. Sinnott,Jianzhong Qi,Cecile Paris,Preslav Nakov,Zhuohan Xie
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
关键词:Misinformation spreading, Internet poses, Retrieval Augmented Generation, societies and individuals, necessitating robust
备注:
点击查看摘要
Abstract:Misinformation spreading over the Internet poses a significant threat to both societies and individuals, necessitating robust and scalable fact-checking that relies on retrieving accurate and trustworthy evidence. Previous methods rely on semantic and social-contextual patterns learned from training data, which limits their generalization to new data distributions. Recently, Retrieval Augmented Generation (RAG) based methods have been proposed to utilize the reasoning capability of LLMs with retrieved grounding evidence documents. However, these methods largely rely on textual similarity for evidence retrieval and struggle to retrieve evidence that captures multi-hop semantic relations within rich document contents. These limitations lead to overlooking subtle factual correlations between the evidence and the claims to be fact-checked during evidence retrieval, thus causing inaccurate veracity predictions. To address these issues, we propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence. LLM-enabled retrieval is designed to assess the claims and retrieve the most relevant knowledge subgraphs, forming structured evidence for fact verification. To augment the knowledge graph evidence, we retrieve web contents for completion. The above process is implemented as an automatic Markov Decision Process (MDP): A reasoning LLM agent decides what actions to take according to the current evidence and the claims. To adapt the MDP for fact-checking, we use prompt optimization to fine-tune the agentic LLM.
Subjects:
Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Cite as:
arXiv:2603.00267 [cs.AI]
(or
arXiv:2603.00267v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.00267
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
28. 【2603.00155】EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection
链接:https://arxiv.org/abs/2603.00155
作者:Wenxin Tang,Jingyu Xiao,Yanpei Gong,Fengyuan Ran,Tongchuan Xia,Junliang Liu,Man Ho Lam,Wenxuan Wang,Michael R. Lyu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:visually coherent presentations, distill lengthy research, lengthy research papers, Multimodal Large Language, Large Language Models
备注:
点击查看摘要
Abstract:Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at this https URL.
29. 【2603.00147】Leveraging GenAI for Segmenting and Labeling Centuries-old Technical Documents
链接:https://arxiv.org/abs/2603.00147
作者:Carlos Monroy,Benjamin Navarro
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
关键词:established computational techniques, established computational, broader discipline, Image segmentation, image processing
备注: 6 pages, 7 figures
点击查看摘要
Abstract:Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an image. These techniques have shown remarkable accuracy with modern images, mainly because the amount of training data is vast. Achieving similar accuracy in digitized images of centuries-old documents is more challenging. This difficulty is due to two main reasons: first, the lack of sufficient training data, and second, because the degree of specialization in a given domain. Despite these limitations, the ability to segment and recognize objects in these collections is important for automating the curation, cataloging, and dissemination of knowledge, making the contents of priceless collections accessible to scholars and the general public. In this paper, we report on our ongoing work in segmenting and labeling images pertaining to shipbuilding treatises from the XVI and XVII centuries, a historical period known as the Age of Exploration. To this end, we leverage SAM2 for image segmentation; Florence2 and ChatGPT for labeling; and a specialized ontology ontoShip and glossary glosShip of nautical architecture for enhancing the labeling process. Preliminary results demonstrate the potential of marrying these technologies for improving curation and retrieval of priceless historical documents. We also discuss the challenges and limitations encountered in this approach and ideas on how to overcome them in the future.
30. 【2603.00126】QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference
链接:https://arxiv.org/abs/2603.00126
作者:Miao Zhang,Ruixiao Zhang,Jianxin Shi,Hengzhi Wang,Hao Fang,Jiangchuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM); Performance (cs.PF); Systems and Control (eess.SY)
关键词:bringing unified solutions, Video-language models, bringing unified, reasoning tasks, unified solutions
备注:
点击查看摘要
Abstract:Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
31. 【2603.00122】NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
链接:https://arxiv.org/abs/2603.00122
作者:Aman Ulla
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:retrieval-augmented generation, important step, step before retrieval-augmented, downstream generative, Document extraction
备注: 17 pages, 10 figures, 5 tables
点击查看摘要
Abstract:Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
32. 【2603.00084】DeepXiv-SDK: An Agentic Data Interface for Scientific Papers
链接:https://arxiv.org/abs/2603.00084
作者:Hongjin Qian,Ziyi Xia,Ze Liu,Jianlv Chen,Kun Luo,Minghao Qin,Chaofan Li,Lei Xiong,Sen Wang,Zhengyang Liang,Zheng Liu
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:data, LLM-agents are increasingly, accelerate the progress, data access, agentic data interface
备注: Project at [this https URL](https://github.com/DeepXiv/deepxiv_sdk)
点击查看摘要
Abstract:LLM-agents are increasingly used to accelerate the progress of scientific research. Yet a persistent bottleneck is data access: agents not only lack readily available tools for retrieval, but also have to work with unstrcutured, human-centric data on the Internet, such as HTML web-pages and PDF files, leading to excessive token consumption, limit working efficiency, and brittle evidence look-up. This gap motivates the development of \textit{an agentic data interface}, which is designed to enable agents to access and utilize scientific literature in a more effective, efficient, and cost-aware manner. In this paper, we introduce DeepXiv-SDK, which offers a three-layer agentic data interface for scientific literature. 1) Data Layer, which transforms unstructured, human-centric data into normalized and structured representations in JSON format, improving data usability and enabling progressive accessibility of the data. 2) Service Layer, which presents readily available tools for data access and ad-hoc retrieval. It also enables a rich form of agent usage, including CLI, MCP, and Python SDK. 3) Application Layer, which creates a built-in agent, packaging basic tools from the service layer to support complex data access demands. DeepXiv-SDK currently supports the complete ArXiv corpus, and is synchronized daily to incorporate new releases. It is designed to extend to all common open-access corpora, such as PubMed Central, bioRxiv, medRxiv, and chemRxiv. We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows. DeepXiv-SDK is free to use with registration.
Comments:
Project at this https URL
Subjects:
Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2603.00084 [cs.DL]
(or
arXiv:2603.00084v2 [cs.DL] for this version)
https://doi.org/10.48550/arXiv.2603.00084
Focus to learn more
arXiv-issued DOI via DataCite</p>
33. 【2603.00026】ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents
链接:https://arxiv.org/abs/2603.00026
作者:Xiaohui Zhang,Zequn Sun,Chengyuan Yang,Yaqin Jin,Yazhong Zhang,Wei Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Effective memory management, large language model, handling long-term interactions, Effective memory, language model
备注:
点击查看摘要
Abstract:Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive "recorders" and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.
34. 【2603.00022】Noise reduction in BERT NER models for clinical entity extraction
链接:https://arxiv.org/abs/2603.00022
作者:Kuldeep Jiwani,Yash K Jeengar,Ayush Dhaka
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Named Entity Recognition, clinical entity extraction, NER, notes and reports, utmost importance
备注:
点击查看摘要
Abstract:Precision is of utmost importance in the realm of clinical entity extraction from clinical notes and reports. Encoder Models fine-tuned for Named Entity Recognition (NER) are an efficient choice for this purpose, as they don't hallucinate. We pre-trained an in-house BERT over clinical data and then fine-tuned it for NER. These models performed well on recall but could not close upon the high precision range, needed for clinical models. To address this challenge, we developed a Noise Removal model that refines the output of NER. The NER model assigns token-level entity tags along with probability scores for each token. Our Noise Removal (NR) model then analyzes these probability sequences and classifies predictions as either weak or strong. A naïve approach might involve filtering predictions based on low probability values; however, this method is unreliable. Owing to the characteristics of the SoftMax function, Transformer based architectures often assign disproportionately high confidence scores even to uncertain or weak predictions, making simple thresholding ineffective. To address this issue, we adopted a supervised modeling strategy in which the NR model leverages advanced features such as the Probability Density Map (PDM). The PDM captures the Semantic-Pull effect observed within Transformer embeddings, an effect that manifests in the probability distributions of NER class predictions across token sequences. This approach enables the model to classify predictions as weak or strong with significantly improved accuracy. With these NR models we were able to reduce False Positives across various clinical NER models by 50\% to 90\%.
35. 【2603.00097】Exploring Drug Safety Through Knowledge Graphs: Protein Kinase Inhibitors as a Case Study
链接:https://arxiv.org/abs/2603.00097
作者:David Jackson,Michael Gertz,Jürgen Hesser
类目:Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Adverse Drug Reactions, Adverse Drug, Drug Reactions, morbidity and mortality, Adverse
备注: 14 pages, 5 figures. Code and data available at [this https URL](https://github.com/davidjackson99/PKI_KG)
点击查看摘要
Abstract:Adverse Drug Reactions (ADRs) are a leading cause of morbidity and mortality. Existing prediction methods rely mainly on chemical similarity, machine learning on structured databases, or isolated target profiles, but often fail to integrate heterogeneous, partly unstructured evidence effectively. We present a knowledge graph-based framework that unifies diverse sources, drug-target data (ChEMBL), clinical trial literature (PubMed), trial metadata (this http URL), and post-marketing safety reports (FAERS) into a single evidence-weighted bipartite network of drugs and medical conditions. Applied to 400 protein kinase inhibitors, the resulting network enables contextual comparison of efficacy (HR, PFS, OS), phenotypic and target similarity, and ADR prediction via target-to-adverse-event correlations. A non-small cell lung cancer case study correctly highlights established and candidate drugs, target communities (ERbB, ALK, VEGF), and tolerability differences. Designed as an orthogonal, extensible analysis and search tool rather than a replacement for current models, the framework excels at revealing complex patterns, supporting hypothesis generation, and enhancing pharmacovigilance. Code and data are publicly available at this https URL.
计算机视觉
1. 【2603.02210】HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
链接:https://arxiv.org/abs/2603.02210
作者:Yichen Liu,Donghao Zhou,Jie Wang,Xin Gao,Guisheng Liu,Jiatong Li,Quanwei Zhang,Qiang Lyu,Lanqing Guo,Shilei Wen,Weiqiang Wang,Pheng-Ann Heng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:play a vital, role in advertising, digital marketing, showcase the integration, integration of humans
备注: Accepted by CVPR 2026 (Project page: \url{ [this https URL](https://correr-zhou.github.io/HiFi-Inpaint/) })
点击查看摘要
Abstract:Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.
2. 【2603.02200】Adaptive Confidence Regularization for Multimodal Failure Detection
链接:https://arxiv.org/abs/2603.02200
作者:Moru Liu,Hao Dong,Olga Fink,Mario Trapp
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:strong predictive performance, high-stakes domains, medical diagnostics, Adaptive Confidence Regularization, models in high-stakes
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at this https URL.
3. 【2603.02194】From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
链接:https://arxiv.org/abs/2603.02194
作者:Mateus Karvat,Bram Adams,Sidney Givigi
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Software Engineering (cs.SE)
关键词:Autonomous vehicle, typically evaluated solely, benchmark performance metrics, solely on benchmark, limited attention
备注:
点击查看摘要
Abstract:Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.
4. 【2603.02190】Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
链接:https://arxiv.org/abs/2603.02190
作者:Divyanshu Daiya,Aniket Bera
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:sketches into coherent, turns storyboard-style, control over agents, fine-grained control, multi-human motion
备注: Accepted to CVPR 2026 Main Conference (11 pages, 5 figures)
点击查看摘要
Abstract:We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.
5. 【2603.02181】Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
链接:https://arxiv.org/abs/2603.02181
作者:Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Intangible Cultural Heritage, Mekong Delta poses, Delta poses unique, Cultural Heritage, Intangible Cultural
备注: Early accept of Vol 2025 No 3, November : Journal on Information Technologies Communications
点击查看摘要
Abstract:The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.
6. 【2603.02175】Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
链接:https://arxiv.org/abs/2603.02175
作者:Yiqi Lin,Guoqiang Liang,Ziyun Zeng,Zechen Bai,Yanzhe Chen,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:witnessed rapid progress, Instruction-based video editing, precise visual control, complex visual nuances, describing complex visual
备注:
点击查看摘要
Abstract:Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at this https URL.
7. 【2603.02172】GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
链接:https://arxiv.org/abs/2603.02172
作者:Srikumar Sastry,Dan Cher,Brian Wei,Aayush Dhakal,Subash Khanal,Dev Gupta,Nathan Jacobs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion transformer designed, diffusion transformer, transformer designed, satellite image, image generation
备注: 26 pages, 17 figures
点击查看摘要
Abstract:We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.
8. 【2603.02162】Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction
链接:https://arxiv.org/abs/2603.02162
作者:Aniek Eijpe,Soufyan Lakbir,Melis Erdal Cesur,Sara P. Oliveira,Angelos Chatzimparmpas,Sanne Abeln,Wilson Silva
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:data sources influence, sources influence predictions, cancer survival prediction, survival prediction, increasingly more accurate
备注:
点击查看摘要
Abstract:While multimodal survival prediction models are increasingly more accurate, their complexity often reduces interpretability, limiting insight into how different data sources influence predictions. To address this, we introduce DIMAFx, an explainable multimodal framework for cancer survival prediction that produces disentangled, interpretable modality-specific and modality-shared representations from histopathology whole-slide images and transcriptomics data. Across multiple cancer cohorts, DIMAFx achieves state-of-the-art performance and improved representation disentanglement. Leveraging its interpretable design and SHapley Additive exPlanations, DIMAFx systematically reveals key multimodal interactions and the biological information encoded in the disentangled representations. In breast cancer survival prediction, the most predictive features contain modality-shared information, including one capturing solid tumor morphology contextualized primarily by late estrogen response, where higher-grade morphology aligned with pathway upregulation and increased risk, consistent with known breast cancer biology. Key modality-specific features capture microenvironmental signals from interacting adipose and stromal morphologies. These results show that multimodal models can overcome the traditional trade-off between performance and explainability, supporting their application in precision medicine.
9. 【2603.02149】3D Field of Junctions: A Noise-Robust, Training-Free Structural Prior for Volumetric Inverse Problems
链接:https://arxiv.org/abs/2603.02149
作者:Namhoon Kim,Narges Moeini,Justin Romberg,Sara Fridovich-Keil
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:face high levels, problems face high, inverse problems face, Field of Junctions, face high
备注: Code will be released soon
点击查看摘要
Abstract:Volume denoising is a foundational problem in computational imaging, as many 3D imaging inverse problems face high levels of measurement noise. Inspired by the strong 2D image denoising properties of Field of Junctions (ICCV 2021), we propose a novel, fully volumetric 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges that best explain each 3D patch of a full volume, while encouraging consistency between overlapping patches. In addition to direct volume denoising, we leverage our 3D FoJ representation as a structural prior that: (i) requires no training data, and thus precludes the risk of hallucination, (ii) preserves and enhances sharp edge and corner structures in 3D, even under low signal to noise ratio (SNR), and (iii) can be used as a drop-in denoising representation via projected or proximal gradient descent for any volumetric inverse problem with low SNR. We demonstrate successful volume reconstruction and denoising with 3D FoJ across three diverse 3D imaging tasks with low-SNR measurements: low-dose X-ray computed tomography (CT), cryogenic electron tomography (cryo-ET), and denoising point clouds such as those from lidar in adverse weather. Across these challenging low-SNR volumetric imaging problems, 3D FoJ outperforms a mixture of classical and neural methods.
10. 【2603.02142】Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
链接:https://arxiv.org/abs/2603.02142
作者:Kwame Mbobda-Kuate,Gabriel Kasmi
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:resource-constrained Earth observation, laws assume larger, consistently outperform smaller, Scaling laws assume, Earth observation
备注: 13 pages, 9 figures, 8 tables
点击查看摘要
Abstract:Scaling laws assume larger models trained on more data consistently outperform smaller ones -- an assumption that drives model selection in computer vision but remains untested in resource-constrained Earth observation (EO). We conduct a systematic efficiency analysis across three scaling dimensions: model size, dataset size, and input resolution, on rooftop PV detection in Madagascar. Optimizing for model efficiency (mAP$_{50}$ per unit of model size), we find a consistent efficiency inversion: YOLO11N achieves both the highest efficiency ($24\times$ higher than YOLO11X) and the highest absolute mAP$_{50}$ (0.617). Resolution is the dominant resource allocation lever ($+$120% efficiency gain), while additional data yields negligible returns at low resolution. These findings are robust to the deployment objective: small high-resolution configurations are Pareto-dominant across all 44 setups in the joint accuracy-throughput space, leaving no tradeoff to resolve. In data-scarce EO, bigger is not just unnecessary: it can be worse.
11. 【2603.02139】Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
链接:https://arxiv.org/abs/2603.02139
作者:Han Xue,Nan Min,Xiaotong Liu,Wendi Chen,Yuan Fang,Jun Lv,Cewu Lu,Chuan Wen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Field of View, exceptionally wide Field, wide Field, wrist-mounted fisheye cameras, rapidly outpacing
备注: 22 pages, 15 figures, Accecpted by CVPR 2026
点击查看摘要
Abstract:The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on this https URL
12. 【2603.02138】OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
链接:https://arxiv.org/abs/2603.02138
作者:Yiying Yang,Wei Cheng,Sijin Chen,Honghao Fu,Xianfang Zeng,Yujun Cai,Gang Yu,Xingjun Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Lottie JSON files, vector animations, versatile framework, quality vector animations, raw Lottie JSON
备注: Accepted by CVPR 2026. Project Page: [this https URL](https://openvglab.github.io/OmniLottie/)
点击查看摘要
Abstract:OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
13. 【2603.02137】NextAds: Towards Next-generation Personalized Video Advertising
链接:https://arxiv.org/abs/2603.02137
作者:Yiyan Xu,Ruoxuan Xia,Wuqiang Zheng,Fengbin Zhu,Wenjie Wang,Fuli Feng
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:digital advertising landscape, personalized video advertising, online video consumption, video advertising, rapid growth
备注:
点击查看摘要
Abstract:With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, selecting the optimal one from a small set of professionally pre-produced creatives for each user. Such static and finite inventories limits both the granularity and the timeliness of personalization, and prevents the creatives from being continuously refined based on online user feedback. Recent advances in generative AI make it possible to move beyond retrieval toward optimizing video creatives in a continuous space at serving time. In this light, we propose NextAds, a generation-based paradigm for next-generation personalized video advertising, and conceptualize NextAds with four core components. To enable comparable research progress, we formulate two representative tasks: personalized creative generation and personalized creative integration, and introduce corresponding lightweight benchmarks. To assess feasibility, we instantiate end-to-end pipelines for both tasks and conduct initial exploratory experiments, demonstrating that GenAI can generate and integrate personalized creatives with encouraging performance. Moreover, we discuss the key challenges and opportunities under this paradigm, aiming to provide actionable insights for both researchers and practitioners and to catalyze progress in personalized video advertising.
Subjects:
Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.02137 [cs.IR]
(or
arXiv:2603.02137v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2603.02137
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2603.02134】OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
链接:https://arxiv.org/abs/2603.02134
作者:Chong Xia,Fangfu Liu,Yule Wang,Yize Pang,Yueqi Duan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, Gaussian Splatting, advances in generalizable, enabled rapid, per-scene optimization
备注:
点击查看摘要
Abstract:Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.
15. 【2603.02133】SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
链接:https://arxiv.org/abs/2603.02133
作者:Chong Xia,Kai Zhu,Zizhuo Wang,Fangfu Liu,Zhizheng Zhang,Yueqi Duan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:create object-centric representations, Compositional scene reconstruction, scene reconstruction seeks, Conventional compositional reconstruction, seeks to create
备注:
点击查看摘要
Abstract:Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.
16. 【2603.02130】Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera
链接:https://arxiv.org/abs/2603.02130
作者:Tutian Tang,Xingyu Ji,Yutong Li,MingHao Liu,Wenqiang Xu,Cewu Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inertial measurement units, sparse inertial measurement, effectively mitigate occlusion, drift issues inherent, Recent advancements
备注: The code, data, and supplementary materials are available at \url{ [this https URL](https://sites.google.com/view/stereo-inertial-poser) }. Accepted to ICRA 2026
点击查看摘要
Abstract:Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
17. 【2603.02129】LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation
链接:https://arxiv.org/abs/2603.02129
作者:Hualiang Wei,Shunran Jia,Jialun Liu,Wenhui Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:drive high-fidelity avatar, head pose, paradigm that completes, completed signals, signals to drive
备注: 19 pages, 11 figures
点击查看摘要
Abstract:We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
18. 【2603.02125】A 3D mesh convolution-based autoencoder for geometry compression
链接:https://arxiv.org/abs/2603.02125
作者:Germain Bregeon,Marius Preda,Radu Ispas,Titus Zaharia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mesh convolution-based autoencoder, irregular mesh data, watertightness conditions, preprocessing nor manifold, convolution-based autoencoder
备注:
点击查看摘要
Abstract:In this paper, we introduce a novel 3D mesh convolution-based autoencoder for geometry compression, able to deal with irregular mesh data without requiring neither preprocessing nor manifold/watertightness conditions. The proposed approach extracts meaningful latent representations by learning features directly from the mesh faces, while preserving connectivity through dedicated pooling and unpooling operations. The encoder compresses the input mesh into a compact base mesh space, which ensures that the latent space remains comparable. The decoder reconstructs the original connectivity and restores the compressed geometry to its full resolution. Extensive experiments on multi-class datasets demonstrate that our method outperforms state-of-the-art approaches in both 3D mesh geometry reconstruction and latent space classification tasks. Code available at: this http URL
19. 【2603.02123】Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
链接:https://arxiv.org/abs/2603.02123
作者:Jiahao Huang,Fengyan Lin,Xuechao Yang,Chen Feng,Kexin Zhu,Xu Yang,Zhide Chen
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:fragmented affective capabilities, high-level interaction, leading to fragmented, long been constrained, capabilities and limited
备注: 17 pages,8 figures, The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
点击查看摘要
Abstract:The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
20. 【2603.02098】OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
链接:https://arxiv.org/abs/2603.02098
作者:Chuong Huynh,Manh Luong,Abhinav Shrivastava
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:retrieve desired targets, desired targets, retrieval, retrieve desired, Multimodal retrieval
备注: CVPR 2026. Project link: [this https URL](https://github.com/hmchuong/omniret)
点击查看摘要
Abstract:Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations. OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks-composed audio retrieval and audio-visual retrieval to more comprehensively evaluate a model's omni-modal embedding capacity.
21. 【2603.02096】FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
链接:https://arxiv.org/abs/2603.02096
作者:Yiweng Xie,Bo He,Junke Wang,Xiangyu Zheng,Ziyi Ye,Zuxuan Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Temporal Adjacency Selection, Spatial Domain Consolidation, streaming video understanding, paper presents FluxMem, efficient streaming video
备注: Accepted at CVPR 2026. Project page: [this https URL](https://yiwengxie.com/FluxMem/)
点击查看摘要
Abstract:This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.
22. 【2603.02087】Detection-Gated Glottal Segmentation with Zero-Shot Cross-Dataset Transfer and Clinical Feature Extraction
链接:https://arxiv.org/abs/2603.02087
作者:Harikrishnan Unnikrishnan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Accurate glottal segmentation, high-speed videoendoscopy, Accurate glottal, segmentation in high-speed, essential for extracting
备注: for associated code see: [this https URL](https://github.com/hari-krishnan/openglottal)
点击查看摘要
Abstract:Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a YOLOv8-based detector with a U-Net segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and instrument occlusion. The model was trained on a limited subset of the GIRAFE dataset (600 frames) and evaluated via zero-shot transfer on the large-scale BAGLS dataset. Results: The pipeline achieved state-of-the-art performance on the GIRAFE benchmark (DSC 0.81) and demonstrated superior generalizability on BAGLS (DSC 0.85, in-distribution) without institutional fine-tuning. Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features (Open Quotient, coefficient of variation) remained consistent with established clinical benchmarks. The coefficient of variation (CV) of the glottal area was found to be a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: The detection-gated architecture provides a lightweight, computationally efficient solution (~35 frames/s) for real-time clinical use. By enabling robust zero-shot transfer, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at this https URL.
Comments:
for associated code see: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2603.02087 [cs.CV]
(or
arXiv:2603.02087v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.02087
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Harikrishnan Unnikrishnan [view email] [v1]
Mon, 2 Mar 2026 17:05:41 UTC (1,455 KB)
23. 【2603.02083】$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs
链接:https://arxiv.org/abs/2603.02083
作者:Siting Wang,Xiaofeng Wang,Zheng Zhu,Minnan Pei,Xinyu Cui,Cheng Deng,Jian Zhao,Guan Huang,Haifeng Zhang,Jun Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:online reinforcement learning, hindering online reinforcement, models excel, multi-step sampling, hindering online
备注:
点击查看摘要
Abstract:Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbol{\pi}$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $\pi$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
24. 【2603.02080】From Pixels to Patches: Pooling Strategies for Earth Embeddings
链接:https://arxiv.org/abs/2603.02080
作者:Isaac Corley,Caleb Robinson,Inbal Becker-Reshef,Juan M. Lavista Ferres
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:downstream label resolution, preserve class-discriminative signal, matching downstream label, geospatial foundation models, practitioners must aggregate
备注:
点击查看摘要
Abstract:As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.
25. 【2603.02079】MMNavAgent: Multi-Magnification WSI Navigation Agent for Clinically Consistent Whole-Slide Analysis
链接:https://arxiv.org/abs/2603.02079
作者:Zhengyang Xu,Han Li,Jingsong Liu,Linrui Xie,Xun Ma,Xin You,Shihui Zu,Ayako Ito,Xinyu Hao,Hongming Xu,Shaohua Kevin Zhou,Nassir Navab,Peter J. Schüffler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improve Whole-Slide Image, diagnostically relevant regions, selecting diagnostically relevant, Whole-Slide Image, predefined magnification traversal
备注:
点击查看摘要
Abstract:Recent AI navigation approaches aim to improve Whole-Slide Image (WSI) diagnosis by modeling spatial exploration and selecting diagnostically relevant regions, yet most operate at a single fixed magnification or rely on predefined magnification traversal. In clinical practice, pathologists examine slides across multiple magnifications and selectively inspect only necessary scales, dynamically integrating global and cellular evidence in a sequential manner. This mismatch prevents existing methods from modeling cross-magnification interactions and adaptive magnification selection inherent to real diagnostic workflows. To these, we propose a clinically consistent Multi-Magnification WSI Navigation Agent (MMNavAgent) that explicitly models multi magnification interaction and adaptive magnification selection. Specifically, we introduce a Cross-Magnification navigation Tool (CMT) that aggregates contextual information from adjacent magnifications to enhance discriminative representations along the navigation path. We further introduce a Magnification Selection Tool (MST) that leverages memory-driven reasoning within the agent framework to enable interactive and adaptive magnification selection, mimicking the sequential decision process of pathologists. Extensive experiments on a public dataset demonstrate improved diagnostic performance, with 1.45% gain of AUC and 2.93% gain of BACC over a non-agent baseline. Code will be public upon acceptance.
26. 【2603.02063】ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks
链接:https://arxiv.org/abs/2603.02063
作者:Joël Küchler,Ellen van Maren,Vaiva Vasiliauskaitė,Katarina Vulić,Reza Abbasi-Asl,Stephan J. Ihle
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:data generation, extracting information, Object-centric representation learning, data, Object-centric representation
备注: GitHub: [this https URL](https://github.com/Hullimulli/ORGAN)
点击查看摘要
Abstract:Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.
27. 【2603.02049】WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories
链接:https://arxiv.org/abs/2603.02049
作者:Yisu Zhang,Chenjie Cao,Tengfei Wang,Xuhui Zuo,Junta Wu,Jianke Zhu,Chunchao Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:foundational Video Diffusion, yielded significant progress, Recent advances, Video Diffusion Models, Video Diffusion
备注:
点击查看摘要
Abstract:Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.
28. 【2603.02047】NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis
链接:https://arxiv.org/abs/2603.02047
作者:Manuel Serna-Aguilera,Raegan Anderes,Page Dobbs,Khoa Luu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:health crisis continues, nicotine addiction public, crisis continues, public health crisis, addiction public health
备注:
点击查看摘要
Abstract:The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.
29. 【2603.02035】LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers
链接:https://arxiv.org/abs/2603.02035
作者:Fabian Schmidt,Karol Fedurko,Markus Enzweiler,Abhinav Valada
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:provide advanced reasoning, discrete semantic knowledge, multimodal large language, continuous trajectories remains, large language models
备注:
点击查看摘要
Abstract:While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on this https URL.
30. 【2603.02026】Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
链接:https://arxiv.org/abs/2603.02026
作者:Simon Ging(1 and 2),Philipp Arnold(3),Sebastian Walter(4),Hani Alnahas(1),Hannah Bast(4),Elmar Kotter(3),Jiancheng Yang(5 and 6),Behzad Bozorgtabar(2),Thomas Brox(1) ((1) Computer Vision Group, University of Freiburg, Germany, (2) Adaptive amp; Agentic AI (A3) Lab, Aarhus University, Denmark, (3) Department of Radiology, Medical Center -- University of Freiburg, Germany, (4) Chair of Algorithms and Data Structures, University of Freiburg, Germany, (5) ELLIS Institute Finland, (6) School of Electrical Engineering, Aalto University, Finland)
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:limited public data, coarse global supervision, models align volumes, vision-language models align, align volumes
备注:
点击查看摘要
Abstract:Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
31. 【2603.02024】MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
链接:https://arxiv.org/abs/2603.02024
作者:Jiachun Li,Shaoping Huang,Zhuoran Jin,Chenlong Zhang,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, multimodal large language, large language models, reasoning, large language
备注: Accepted by ICLR 2026, 78 pages, 60 figures
点击查看摘要
Abstract:Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
32. 【2603.02012】MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising
链接:https://arxiv.org/abs/2603.02012
作者:Peiyuan Jing,Chun-Wun Cheng,Liutao Yang,Zhenxuan Zhang,Thiago V. Lima,Klaus Strobel,Antoine Leimgruber,Angelica Aviles-Rivero,Guang Yang,Javier A. Montoya-Zegarra
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Low-dose Positron Emission, Positron Emission Tomography, Low-dose Positron, Emission Tomography, Positron Emission
备注: 8 pages, 3 figures
点击查看摘要
Abstract:Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.
33. 【2603.01999】Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation
链接:https://arxiv.org/abs/2603.01999
作者:Jan Finke,Wayne Paul Martis,Adrian Schmelter,Lars Erbach,Christian Jestel,Marvin Wiedemann
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:industrial settings demands, Reliable obstacle avoidance, missing critical obstacles, LiDAR sensors perceive, single horizontal slice
备注:
点击查看摘要
Abstract:Reliable obstacle avoidance in industrial settings demands 3D scene understanding, but widely used 2D LiDAR sensors perceive only a single horizontal slice of the environment, missing critical obstacles above or below the scan plane. We present a teacher-student framework for vision-based mobile robot navigation that eliminates the need for LiDAR sensors. A teacher policy trained via Proximal Policy Optimization (PPO) in NVIDIA Isaac Lab leverages privileged 2D LiDAR observations that account for the full robot footprint to learn robust navigation. The learned behavior is distilled into a student policy that relies solely on monocular depth maps predicted by a fine-tuned Depth Anything V2 model from four RGB cameras. The complete inference pipeline, comprising monocular depth estimation (MDE), policy execution, and motor control, runs entirely onboard an NVIDIA Jetson Orin AGX mounted on a DJI RoboMaster platform, requiring no external computation for inference. In simulation, the student achieves success rates of 82-96.5%, consistently outperforming the standard 2D LiDAR teacher (50-89%). In real-world experiments, the MDE-based student outperforms the 2D LiDAR teacher when navigating around obstacles with complex 3D geometries, such as overhanging structures and low-profile objects, that fall outside the single scan plane of a 2D LiDAR.
34. 【2603.01997】Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering
链接:https://arxiv.org/abs/2603.01997
作者:Hari Prasanth S.M.,Pejman Habibiroudkenar,Eerik Alamikkotervo,Dimitrios Bouzoulas,Risto Ojala
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:fast-moving aerial objects, prediction remains limited, observing fast-moving aerial, Event cameras provide, trajectory prediction remains
备注: Submitted to ICUAS 2026 conference
点击查看摘要
Abstract:Event cameras provide high-temporal-resolution visual sensing that is well suited for observing fast-moving aerial objects; however, their use for drone trajectory prediction remains limited. This work introduces an event-only drone forecasting method that exploits propeller-induced motion cues. Propeller rotational speed are extracted directly from raw event data and fused within an RPM-aware Kalman filtering framework. Evaluations on the FRED dataset show that the proposed method outperforms learning-based approaches and vanilla kalman filter in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons. The results demonstrate robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data.
35. 【2603.01993】Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
链接:https://arxiv.org/abs/2603.01993
作者:Yuchen Zhang,Yaxiong Wang,Kecheng Han,Yujiao Wu,Lianwei Wu,Li Zhu,Zhedong Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:posing substantial challenges, Recent advances, multimodal media manipulation, advances in generative, significantly enhanced
备注:
点击查看摘要
Abstract:Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
36. 【2603.01990】According to Me: Long-Term Personalized Referential Memory QA
链接:https://arxiv.org/abs/2603.01990
作者:Jingbiao Mei,Jinghong Chen,Guangyu Yang,Xinyu Hou,Margaret Li,Bill Byrne
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:naturally spans multiple, spans multiple modalities, long-term user memory, existing Long-term Memory, long-term user
备注: Preprint
点击查看摘要
Abstract:Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: this https URL
37. 【2603.01976】Robust White Blood Cell Classification with Stain-Normalized Decoupled Learning and Ensembling
链接:https://arxiv.org/abs/2603.01976
作者:Luu Le,Hoang-Loc Cao,Ha-Hieu Pham,Thanh-Huy Nguyen,Ulas Bagci
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:White blood cell, leukemia screening, infection assessment, treatment monitoring, Robust White Blood
备注:
点击查看摘要
Abstract:White blood cell (WBC) classification is fundamental for hematology applications such as infection assessment, leukemia screening, and treatment monitoring. However, real-world WBC datasets present substantial appearance variations caused by staining and scanning conditions, as well as severe class imbalance in which common cell types dominate while rare but clinically important categories are underrepresented. To address these challenges, we propose a stain-normalized, decoupled training framework that first learns transferable representations using instance-balanced sampling, and then rebalances the classifier with class-aware sampling and a hybrid loss combining effective-number weighting and focal modulation. In inference stage, we further enhance robustness by ensembling various trained backbones with test-time augmentation. Our approach achieved the top rank on the leaderboard of the WBCBench 2026: Robust White Blood Cell Classification Challenge at ISBI 2026.
38. 【2603.01953】Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy
链接:https://arxiv.org/abs/2603.01953
作者:Pengyuan Wu,Pingrui Zhang,Zhigang Wang,Dong Wang,Bin Zhao,Xuelong Li
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable results, Diffusion-based policies, Closed-Loop Diffusion Policy, Diffusion Policy framework, leading to delayed
备注: Accepted by ICRA2026
点击查看摘要
Abstract:Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: this https URL
39. 【2603.01950】Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
链接:https://arxiv.org/abs/2603.01950
作者:Christopher Driggers-Ellis,Nachiketh Tibrewal,Rohit Bogulla,Harsh Khanna,Sangpil Youm,Christan Grant,Bonnie Dorr
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:manga would introduce, medium of storytelling, visually impaired users, visually impaired, impaired users
备注: 8 pages, 2 figures, 3 tables. Includes link to code
点击查看摘要
Abstract:A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
40. 【2603.01948】PreSight: Preoperative Outcome Prediction for Parkinson's Disease via Region-Prior Morphometry and Patient-Specific Weighting
链接:https://arxiv.org/abs/2603.01948
作者:Yand Wang,Chen Zhang,Lanyun Zhu,Yixin Chen,Qunbo Wang,Yutong Bai,Jurgen Germann,Yinghong Wen,Shuai Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Parkinson disease surgery, Parkinson disease, Preoperative improvement rate, patients are heterogeneous, clinically important
备注:
点击查看摘要
Abstract:Preoperative improvement rate prediction for Parkinson's disease surgery is clinically important yet difficult because imaging signals are subtle and patients are heterogeneous. We address this setting, where only information available before surgery is used, and the goal is to predict patient-specific postoperative motor benefit. We present PreSight, a presurgical outcome model that fuses clinical priors with preoperative MRI and deformation-based morphometry (DBM) and adapts regional importance through a patient-specific weighting module. The model produces end-to-end, calibrated, decision-ready predictions with patient-level explanations. We evaluate PreSight on a real-world two-center cohort of 400 subjects with multimodal presurgical inputs and postoperative improvement labels. PreSight outperforms strong clinical, imaging-only, and multimodal baselines. It attains 88.89% accuracy on internal validation and 85.29% on an external-center test for responder classification and shows better probability calibration and higher decision-curve net benefit. Ablations and analyses confirm the contribution of DBM and the patient-specific weighting module and indicate that the model emphasizes disease-relevant regions in a patient-specific manner. These results demonstrate that integrating clinical prior knowledge with region-adaptive morphometry enables reliable presurgical decision support in routine practice.
41. 【2603.01947】physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection
链接:https://arxiv.org/abs/2603.01947
作者:Yuting Wan,Liguo Sun,Jiuwu Hao,Zao Zhang,Pin LV
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Unmanned Surface Vehicles, Detecting water-surface targets, targets for Unmanned, Toggle, Surface Vehicles
备注:
点击查看摘要
Abstract:Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.01947 [cs.CV]
(or
arXiv:2603.01947v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.01947
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yuting Wan [view email] [v1]
Mon, 2 Mar 2026 15:00:22 UTC (4,061 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection, by Yuting Wan and 4 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2026-03
Change to browse by:
cs
cs.AI
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
42. 【2603.01944】MobileMold: A Smartphone-Based Microscopy Dataset for Food Mold Detection
链接:https://arxiv.org/abs/2603.01944
作者:Dinh Nam Pham,Leonard Prokisch,Bennet Meyer,Jonas Thumbs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:turn everyday devices, reveal fungal structures, enabling mold inspection, clip-on microscopes turn, microscopes turn everyday
备注: Accepted to ACM Multimedia Systems (MMSys'26). Dataset and code available at [this https URL](https://mobilemold.github.io/dataset/)
点击查看摘要
Abstract:Smartphone clip-on microscopes turn everyday devices into low-cost, portable imaging systems that can even reveal fungal structures at the microscopic level, enabling mold inspection beyond unaided visual checks. In this paper, we introduce MobileMold, an open smartphone-based microscopy dataset for food mold detection and food classification. MobileMold contains 4,941 handheld microscopy images spanning 11 food types, 4 smartphones, 3 microscopes, and diverse real-world conditions. Beyond the dataset release, we establish baselines for (i) mold detection and (ii) food-type classification, including a multi-task setting that predicts both attributes. Across multiple pretrained deep learning architectures and augmentation strategies, we obtain near-ceiling performance (accuracy = 0.9954, F1 = 0.9954, MCC = 0.9907), validating the utility of our dataset for detecting food spoilage. To increase transparency, we complement our evaluation with saliency-based visual explanations highlighting mold regions associated with the model's predictions. MobileMold aims to contribute to research on accessible food-safety sensing, mobile imaging, and exploring the potential of smartphones enhanced with attachments.
43. 【2603.01932】BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation
链接:https://arxiv.org/abs/2603.01932
作者:Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Dustin Severtson,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cereal fields requires, fields requires pixel-level, Accurate weed mapping, requires pixel-level segmentation, cereal fields
备注:
点击查看摘要
Abstract:Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop--weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopies. We propose VISA (Vegetation-Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention to preserve fine textures and row boundaries that are attenuated by ratio indices. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The BAWSeg data, VISA code, and trained models will be released upon publication.
44. 【2603.01928】LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
链接:https://arxiv.org/abs/2603.01928
作者:Yuechen Luo,Fang Li,Shaoqing Xu,Yang Ji,Zehan Zhang,Bing Wang,Yuannan Shen,Jianwei Cui,Long Chen,Guang Chen,Hangjun Ye,Zhi-Xin Yang,Fuxi Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized autonomous driving, perception and planning, leads to semantic-perceptual, perceptual-symbolic conflicts, revolutionized autonomous
备注:
点击查看摘要
Abstract:While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
45. 【2603.01926】MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation
链接:https://arxiv.org/abs/2603.01926
作者:Xinxin Dong,Haokai Ma,Yuze Zheng,Yongfu Zha,Yonghui Yang,Xiaodong Wang
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Micro-video recommendation aims, capture user preferences, aims to capture, context information, Micro-video recommendation
备注:
点击查看摘要
Abstract:Micro-video recommendation aims to capture user preferences from the collaborative and context information of the interacted micro-videos, thereby predicting the appropriate videos. This target is often hindered by the inherent noise within multimodal content and unreliable implicit feedback, which weakens the correspondence between behaviors and underlying interests. While conventional works have predominantly approached such scenario through behavior-augmented modeling and content-centric multimodal analysis, these paradigms can inadvertently give rise to two non-trivial challenges: preference-irrelative video representation extraction and inherent modality conflicts. To address these issues, we propose a Multi-granularity sequential modeling method via hierarchical diffusion models for micro-video Recommendation (MealRec), which simultaneously considers temporal correlations during preference modeling from intra- and inter-video perspectives. Specifically, we first propose Temporal-guided Content Diffusion (TCD) to refine video representations under intra-video temporal guidance and personalized collaborative signals to emphasize salient content while suppressing redundancy. To achieve the semantically coherent preference modeling, we further design the Noise-unconditional Preference Denoising (NPD) to recovers informative user preferences from corrupted states under the blind denoising. Extensive experiments and analyses on four micro-video datasets from two platforms demonstrate the effectiveness, universality, and robustness of our MealRec, further uncovering the effective mechanism of our proposed TCD and NPD. The source code and corresponding dataset will be available upon acceptance.
46. 【2603.01913】Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport
链接:https://arxiv.org/abs/2603.01913
作者:Muyu Liu,Chenhe Du,Xuanyu Tian,Qing Wu,Xiao Wang,Haonan Zhang,Hongjiang Wei,Yuyao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:magnetic resonance imaging, field-dependent relaxation dynamics, contrast distortion due, magnetic resonance, democratizes access
备注: 11 pages, 4 figures, conference paper
点击查看摘要
Abstract:Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.
47. 【2603.01893】Generative Visual Chain-of-Thought for Image Editing
链接:https://arxiv.org/abs/2603.01893
作者:Zijin Yin,Tiankai Hang,Yiji Cheng,Shiyi Zhang,Runze He,Yu Xu,Chunyu Wang,Bing Li,Zheng Chang,Kongming Liang,Qinglin Lu,Zhanyu Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:nuanced spatial instructions, Existing image editing, editing methods struggle, Existing image, methods struggle
备注: Project page: [this https URL](https://pris-cv.github.io/GVCoT/)
点击查看摘要
Abstract:Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
48. 【2603.01890】Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling
链接:https://arxiv.org/abs/2603.01890
作者:Muyu Liu,Xuanyu Tian,Chenhe Du,Qing Wu,Hongjiang Wei,Yuyao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic range compression, blind inverse problem, challenging blind inverse, irreversible information loss, information loss introduced
备注: 16 pages, 10 figures, conference paper
点击查看摘要
Abstract:Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as low-light enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the \textbf{cascaded monotonic Bernstein} (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing \textbf{CaMB-Diff}. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.
49. 【2603.01878】CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection
链接:https://arxiv.org/abs/2603.01878
作者:Yiheng Li,Zichang Tan,Guoqing Xu,Yijun Ye,Yang Yang,Zhen Lei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthetic Computed Tomography, Computed Tomography, demonstrated great potential, synthetic Computed, medical imaging
备注: under review, repo: [this https URL](https://github.com/liyih/CTForensics)
点击查看摘要
Abstract:With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.
50. 【2603.01864】Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
链接:https://arxiv.org/abs/2603.01864
作者:Alexander Prutsch,David Schinagl,Horst Possegger
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:neighboring traffic agents, Future trajectories, trajectories of neighboring, neighboring traffic, traffic agents
备注: WACV 2026 Oral. Project Page at [this https URL](https://a-pru.github.io/seam/)
点击查看摘要
Abstract:Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles. While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context. However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment. We achieve state-of-the-art streaming trajectory prediction results on the Argoverse~2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.
51. 【2603.01850】ny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones
链接:https://arxiv.org/abs/2603.01850
作者:Ilenia Carboni,Elia Cereda,Lorenzo Lamberti,Daniele Malpetti,Francesco Conti,Daniele Palossi
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
关键词:nano-sized aerial robots, autonomously explore cluttered, nano-sized aerial, narrow environments, rescue missions
备注: This paper has been accepted for publication in the IEEE ICRA 2026 conference. ©2026 IEEE
点击查看摘要
Abstract:Sub-30g nano-sized aerial robots can leverage their agility and form factor to autonomously explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering $\sim$100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 96% reduction in Tiny-DroNeRF's memory footprint compared to Instant-NGP, with only a 5.7 dB drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone's memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
52. 【2603.01847】GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection
链接:https://arxiv.org/abs/2603.01847
作者:Yutong Yang,Katarina Popović,Julian Wiederer,Markus Braun,Vasileios Belagiannis,Bin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:variants show strong, show strong performance, strong performance, key task, uncertainty
备注: Accepted to IEEE IV 2026. 8 pages, 5 figures
点击查看摘要
Abstract:Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at this https URL.
Comments:
Accepted to IEEE IV 2026. 8 pages, 5 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.01847 [cs.CV]
(or
arXiv:2603.01847v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.01847
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2603.01840】FireRed-OCR Technical Report
链接:https://arxiv.org/abs/2603.01840
作者:Hao Wu,Haoran Lou,Xinyue Li,Zuodong Zhong,Zhaojun Sun,Phellon Chen,Xuanhe Zhou,Kai Zuo,Yibo Chen,Xu Tang,Yao Hu,Boxiang Zhou,Jian Wu,Yongji Wu,Wenxin Yu,Yingmiao Liu,Yuhao Huang,Manjie Xu,Gang Liu,Yidong Ma,Zhichao Sun,Changhao Qiao
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:high-performance OCR models, specialize general VLMs, high-performance OCR, industrial OCR applications, Semantics Data Factory
备注:
点击查看摘要
Abstract:We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
54. 【2603.01839】LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization
链接:https://arxiv.org/abs/2603.01839
作者:Kuangyi Chen,Jun Zhang,Yuxi Hu,Yi Zhou,Friedrich Fraundorfer
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:visually degraded environments, Event cameras offer, LiDAR point clouds, cameras offer, sensing that remains
备注:
点击查看摘要
Abstract:Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
55. 【2603.01836】Affine Correspondences in Stereo Vision: Theory, Practice, and Limitations
链接:https://arxiv.org/abs/2603.01836
作者:Levente Hajder
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Affine transformations, computer vision application, Affine, surface normals, reconstructed surface normals
备注:
点击查看摘要
Abstract:Affine transformations have been recently used for stereo vision. They can be exploited in various computer vision application, e.g., when estimating surface normals, homographies, fundamental and essential matrices. Even full 3D reconstruction can be obtained by using affine correspondences. First, this paper overviews the fundamental statements for affine transformations and epipolar geometry. Then it is investigated how the transformation accuracy influences the quality of the 3D reconstruction. Besides, we propose novel techniques for estimating the local affine transformation from corresponding image directions; moreover, the fundamental matrix, related to the processed image pair, can also be exploited. Both synthetic and real quantitative evaluations are implemented based on the accuracy of the reconstructed surface normals. For the latter one, a special object, containing three perpendicular planes with chessboard patterns, is constructed. The quantitative evaluations are based on the accuracy of the reconstructed surface normals and it is concluded that the estimation accuracy is around a few degrees for realistic test cases. Special stereo poses and plane orientations are also evaluated in detail.
56. 【2603.01812】Neural Operator-Grounded Continuous Tensor Function Representation and Its Applications
链接:https://arxiv.org/abs/2603.01812
作者:Ruoyang Su,Xi-Le Zhao,Sheng Liu,Wei-Hao Wu,Yisi Luo,Michael K. Ng
类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:attracted increasing attention, continuous tensor function, tensor function, tensor, continuous tensor
备注:
点击查看摘要
Abstract:Recently, continuous tensor functions have attracted increasing attention, because they can unifiedly represent data both on mesh grids and beyond mesh grids. However, since mode-$n$ product is essentially discrete and linear, the potential of current continuous tensor function representations is still locked. To break this bottleneck, we suggest neural operator-grounded mode-$n$ operators as a continuous and nonlinear alternative of discrete and linear mode-$n$ product. Instead of mapping the discrete core tensor to the discrete target tensor, proposed mode-$n$ operator directly maps the continuous core tensor function to the continuous target tensor function, which provides a genuine continuous representation of real-world data and can ameliorate discretization artifacts. Empowering with continuous and nonlinear mode-$n$ operators, we propose a neural operator-grounded continuous tensor function representation (abbreviated as NO-CTR), which can more faithfully represent complex real-world data compared with classic discrete tensor representations and continuous tensor function representations. Theoretically, we also prove that any continuous tensor function can be approximated by NO-CTR. To examine the capability of NO-CTR, we suggest an NO-CTR-based multi-dimensional data completion model. Extensive experiments across various data on regular mesh grids (multi-spectral images and color videos), on mesh girds with different resolutions (Sentinel-2 images) and beyond mesh grids (point clouds) demonstrate the superiority of NO-CTR.
57. 【2603.01804】Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
链接:https://arxiv.org/abs/2603.01804
作者:Dragos Costea,Alina Marcu,Cristina Lazar,Marius Leordeanu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:AI-generated data compared, AI-generated data, data compared, human-generated data, study the ongoing
备注:
点击查看摘要
Abstract:We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
58. 【2603.01776】FreeAct: Freeing Activations for LLM Quantization
链接:https://arxiv.org/abs/2603.01776
作者:Xiaohao Liu,Xiaobo Xia,Manyi Zhang,Ji-Fu Li,Xianzhi Yu,Fei Shen,Xiu Su,See-Kiong Ng,Tat-Seng Chua
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Language Models, Large Language, overhead of Large, pivotal for mitigating
备注: 26 pages, 18 figures, 2 tables
点击查看摘要
Abstract:Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
59. 【2603.01767】Downstream Task Inspired Underwater Image Enhancement: A Perception-Aware Study from Dataset Construction to Network Design
链接:https://arxiv.org/abs/2603.01767
作者:Bosen Lin,Feng Gao,Yanwei Yu,Junyu Dong,Qian Du
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:face challenges posed, real underwater environments, Underwater image enhancement, color inconsistencies, face challenges
备注: Accepted for publication in IEEE TIP 2026
点击查看摘要
Abstract:In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high-frequency details that are critical for task-specific recognition. To address this issue, we propose a Downstream Task-Inspired Underwater Image Enhancement (DTI-UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two-branch network with task-aware attention module for feature mixing. The network benefits from a multi-stage training framework and a task-driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task-Inspired UIE Dataset (TI-UIED) using various task-specific networks. Experimental results demonstrate that DTI-UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at this https URL.
60. 【2603.01765】Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
链接:https://arxiv.org/abs/2603.01765
作者:Minseok Seo,Wonjun Lee,Jaehyuk Jang,Changick Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:gained attention, ability to generalize, generalize across environments, environments without sensor-specific, Abstract
备注: 17 pages, 7 figures [We achieved a new Pareto frontier in test-time depth completion.]
点击查看摘要
Abstract:Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward--backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.
61. 【2603.01758】Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
链接:https://arxiv.org/abs/2603.01758
作者:Yuxuan Li,Yuming Chen,Yunheng Li,Ming-Ming Cheng,Xiang Li,Jian Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remote sensing object, accurately detect objects, multi-modal remote sensing, Heterogeneous multi-modal remote, sensing object detection
备注:
点击查看摘要
Abstract:Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: this https URL.
62. 【2603.01757】StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
链接:https://arxiv.org/abs/2603.01757
作者:Keli Liu,Zhendong Wang,Wengang Zhou,Houqiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cost grows quadratically, Visual AutoRegressive, enable efficient hierarchical, inference cost grows, efficient hierarchical generation
备注:
点击查看摘要
Abstract:Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.
63. 【2603.01756】NeuroSymb-MRG: Differentiable Abductive Reasoning with Active Uncertainty Minimization for Radiology Report Generation
链接:https://arxiv.org/abs/2603.01756
作者:Rong Fu,Yiqing Lyu,Chunlei Meng,Muge Qi,Yabin Jin,Qi Zhao,Li Bao,Juntao Gao,Fuqian Shi,Nilanjan Dey,Wei Luo,Simon Fong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduce clinician workload, Automatic generation, radiology reports seeks, improving documentation consistency, generation of radiology
备注: 12 pages, 1 figure
点击查看摘要
Abstract:Automatic generation of radiology reports seeks to reduce clinician workload while improving documentation consistency. Existing methods that adopt encoder-decoder or retrieval-augmented pipelines achieve progress in fluency but remain vulnerable to visual-linguistic biases, factual inconsistency, and lack of explicit multi-hop clinical reasoning. We present NeuroSymb-MRG, a unified framework that integrates NeuroSymbolic abductive reasoning with active uncertainty minimization to produce structured, clinically grounded reports. The system maps image features to probabilistic clinical concepts, composes differentiable logic-based reasoning chains, decodes those chains into templated clauses, and refines the textual output via retrieval and constrained language-model editing. An active sampling loop driven by rule-level uncertainty and diversity guides clinician-in-the-loop adjudication and promptbook refinement. Experiments on standard benchmarks demonstrate consistent improvements in factual consistency and standard language metrics compared to representative baselines.
64. 【2603.01746】An Analysis of Multi-Task Architectures for the Hierarchic Multi-Label Problem of Vehicle Model and Make Classification
链接:https://arxiv.org/abs/2603.01746
作者:Alexandru Manole,Laura Diosan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:semantically rich structure, Deep Learning approaches, Deep Learning, organized hierarchically, Deep Learning classifiers
备注: 14 pages, 8 figures ,7 tables
点击查看摘要
Abstract:Most information in our world is organized hierarchically; however, many Deep Learning approaches do not leverage this semantically rich structure. Research suggests that human learning benefits from exploiting the hierarchical structure of information, and intelligent models could similarly take advantage of this through multi-task learning. In this work, we analyze the advantages and limitations of multi-task learning in a hierarchical multi-label classification problem: car make and model classification. Considering both parallel and cascaded multi-task architectures, we evaluate their impact on different Deep Learning classifiers (CNNs, Transformers) while varying key factors such as dropout rate and loss weighting to gain deeper insight into the effectiveness of this approach. The tests are conducted on two established benchmarks: StanfordCars and CompCars. We observe the effectiveness of the multi-task paradigm on both datasets, improving the performance of the investigated CNN in almost all scenarios. Furthermore, the approach yields significant improvements on the CompCars dataset for both types of models.
65. 【2603.01743】Action-Guided Attention for Video Action Anticipation
链接:https://arxiv.org/abs/2603.01743
作者:Tsung-Ming Tai,Sofia Casarin,Andrea Pilzer,Werner Nutt,Oswald Lanz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Anticipating future actions, observed frames provide, Anticipating future, predict upcoming actions, requiring the inference
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
66. 【2603.01725】Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration
链接:https://arxiv.org/abs/2603.01725
作者:Guanglu Dong,Chunlei Li,Chao Ren,Jingliang Hu,Yilei Shi,Xiao Xiang Zhu,Lichao Mou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant breakthroughs, Recently, Task, image, Task Prompt
备注: ICLR 2026
点击查看摘要
Abstract:Recently, significant breakthroughs have been made in all-in-one image restoration (AiOIR), which can handle multiple restoration tasks with a single model. However, existing methods typically focus on a specific image domain, such as natural scene, medical imaging, or remote sensing. In this work, we aim to extend AiOIR to multiple domains and propose the first multi-domain all-in-one image restoration method, DATPRL-IR, based on our proposed Domain-Aware Task Prompt Representation Learning. Specifically, we first construct a task prompt pool containing multiple task prompts, in which task-related knowledge is implicitly encoded. For each input image, the model adaptively selects the most relevant task prompts and composes them into an instance-level task representation via a prompt composition mechanism (PCM). Furthermore, to endow the model with domain awareness, we introduce another domain prompt pool and distill domain priors from multimodal large language models into the domain prompts. PCM is utilized to combine the adaptively selected domain prompts into a domain representation for each input image. Finally, the two representations are fused to form a domain-aware task prompt representation which can make full use of both specific and shared knowledge across tasks and domains to guide the subsequent restoration process. Extensive experiments demonstrate that our DATPRL-IR significantly outperforms existing SOTA image restoration methods, while exhibiting strong generalization capabilities. Code is available at this https URL.
67. 【2603.01720】Preoperative-to-intraoperative Liver Registration for Laparoscopic Surgery via Latent-Grounded Correspondence Constraints
链接:https://arxiv.org/abs/2603.01720
作者:Ruize Cui,Jialun Pei,Haiqiao Wang,Jun Zhou,Jeremy Yuen-Chun Teoh,Pheng-Ann Heng,Jing Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:laparoscopic liver surgery, augmented reality technology, reality technology enhances, technology enhances intraoperative, enhances intraoperative anatomical
备注: 10 pages, 4 figures
点击查看摘要
Abstract:In laparoscopic liver surgery, augmented reality technology enhances intraoperative anatomical guidance by overlaying 3D liver models from preoperative CT/MRI onto laparoscopic 2D views. However, existing registration methods lack explicit modeling of reliable 2D-3D geometric correspondences supported by latent evidence, leading to limited interpretability and potentially unstable alignment in clinical scenarios. In this work, we introduce Land-Reg, a correspondence-driven deformable registration framework that explicitly learns latent-grounded 2D-3D landmark correspondences as an interpretable intermediate representation to bridge cross-modal alignment. For rigid registration, Land-Reg embraces a Cross-modal Latent Alignment module to map multi-modal features into a unified latent space. Further, an Uncertainty-enhanced Overlap Landmark Detector with similarity matching is proposed to robustly estimate explicit 2D-3D landmark correspondences. For non-rigid registration, we design a novel shape-constrained supervision strategy that anchors shape deformation to matched landmarks through reprojection consistency and incorporates local-isometric regularization to alleviate inherent 2D-3D depth ambiguity, while a rendered-mask alignment enforces global shape consistency. Experimental results on the P2ILF dataset demonstrate the superiority of our method on both rigid pose estimation and non-rigid deformation. Our code will be available at this https URL.
68. 【2603.01713】Dual Distillation for Few-Shot Anomaly Detection
链接:https://arxiv.org/abs/2603.01713
作者:Le Dong,Qinzhong Tan,Chunlei Li,Jingliang Hu,Yilei Shi,Weisheng Dong,Xiao Xiang Zhu,Lichao Mou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:impact patient outcomes, identifying pathologies early, directly impact patient, Anomaly detection, patient outcomes
备注: ICLR 2026
点击查看摘要
Abstract:Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at this https URL.
69. 【2603.01708】WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
链接:https://arxiv.org/abs/2603.01708
作者:Gong Chen,Chaokun Zhang,Xinyan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vital for autonomous, autonomous driving, driving yet remains, remains constrained, constrained by tight
备注: Accepted by CVPR26
点击查看摘要
Abstract:Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce \textit{WhisperNet}, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency. Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4\% with only 0.5\% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5\% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across \textit{what} and \textit{where} to share is the key to achieving efficient collaborative perception.
70. 【2603.01706】Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking
链接:https://arxiv.org/abs/2603.01706
作者:Tianqi Shen,Huakao Lin,Ning An
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:convolutional or Transformer, sophisticated fusion mechanisms, fusion mechanisms built, increasingly sophisticated fusion, Siamese visual trackers
备注: 23 pages, 12 figures, 7 tables. This work was completed in 2024 and accepted for publication in IEEE TCDS (2026)
点击查看摘要
Abstract:Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).
71. 【2603.01698】owards Principled Dataset Distillation: A Spectral Distribution Perspective
链接:https://arxiv.org/abs/2603.01698
作者:Ruixi Wu,Shaobo Wang,Jiahuan Chen,Zhiyuan Liu,Yicun Yang,Zhaorun Chen,Zekai Li,Kaixin Li,Xinming Wang,Hongzhu Yi,Kai Wang,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:efficient model training, compact synthetic counterparts, compress large-scale datasets, aims to compress, model training
备注: 30 pages, 5 tables, 4 figures
点击查看摘要
Abstract:Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.
72. 【2603.01696】Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
链接:https://arxiv.org/abs/2603.01696
作者:Haonan Jia,Shichao Dong,Xin Dong,Zenghui Sun,Jin Wang,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, misrepresent critical visual, critical visual content
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on this http URL code will be released when the paper is accepted.
73. 【2603.01694】MVR: Multi-view Video Reward Shaping for Reinforcement Learning
链接:https://arxiv.org/abs/2603.01694
作者:Lirui Luo,Guoxi Zhang,Hongming Xu,Yaodong Yang,Cong Fang,Qing Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:reinforcement learning, great importance, importance for solving, solving complex tasks, Reward Shaping
备注: ICLR 2026
点击查看摘要
Abstract:Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
74. 【2603.01688】CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
链接:https://arxiv.org/abs/2603.01688
作者:Gong Chen,Chaokun Zhang,Pengcheng Lv
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improve scene understanding, agents share information, scene understanding, agents share, share information
备注: Accepted by CVPR26
点击查看摘要
Abstract:Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.
75. 【2603.01686】DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs
链接:https://arxiv.org/abs/2603.01686
作者:Aryan Goyal,Ashish Mittal,Pranav Rao,Manoj Tadepalli,Preetham Putha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep learning-based automated, initiate treatment earlier, learning-based automated diagnosis, enables healthcare professionals, Deep learning-based
备注: Published at MICCAI 2025
点击查看摘要
Abstract:Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.
76. 【2603.01685】FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
链接:https://arxiv.org/abs/2603.01685
作者:Shao Shitong,Gu Yufei,Xie Zeke
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recent advent, advent of powerful, Hunyuan, Kling, powerful video generation
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
77. 【2603.01659】A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs
链接:https://arxiv.org/abs/2603.01659
作者:Aryan Goyal,Shreshtha Singh,Ashish Mittal,Manoj Tadepalli,Piyush Kumar,Preetham Putha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improving patient outcomes, remains challenging due, detection remains challenging, Early detection, chest radiographs
备注: Accepted at MIDL 2026 (Poster). Published on OpenReview on February 14, 2026. Proceedings version pending. OpenReview: [this https URL](https://openreview.net/forum?id=7DL7cu8Ui8)
点击查看摘要
Abstract:Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.
78. 【2603.01650】PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
链接:https://arxiv.org/abs/2603.01650
作者:Xianqi Wang,Hao Yang,Hangtian Wang,Junda Cheng,Gangwei Xu,Min Lin,Xin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern stereo matching, monocular depth foundation, monocular depth, depth foundation models, Modern stereo
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.
79. 【2603.01647】QCAgent: An agentic framework for quality-controllable pathology report generation from whole slide image
链接:https://arxiv.org/abs/2603.01647
作者:Rundong Wang,Wei Ba,Ying Zhou,Yingtai Li,Bowen Liu,Baizhi Wang,Yuhao Wang,Zhidong Yang,Kun Zhang,Rui Yan,S. Kevin Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:localized visual evidence, ground fine-grained statements, Recent methods, producing slide-level diagnostic, slide-level diagnostic descriptions
备注:
点击查看摘要
Abstract:Recent methods for pathology report generation from whole-slide image (WSI) are capable of producing slide-level diagnostic descriptions but fail to ground fine-grained statements in localized visual evidence. Furthermore, they lack control over which diagnostic details to include and how to verify them. Inspired by emerging agentic analysis paradigms and the diagnostic workflow of pathologists,who selectively examine multiple fields of view, we propose QCAgent, an agentic framework for quality-controllable WSI report generation. The core innovations of this framework are as follows: (i) it incorporates a customized critique mechanism guided by a user-defined checklist specifying required diagnostic details and constraints; (ii) it re-identifies informative regions in the WSI based on the critique feedback and text-patch semantic retrieval, a process that iteratively enriches and reconciles the report. Experiments demonstrate that by making report requirements explicitly prompt-defined, constraint-aware, and verifiable through evidence-grounded refinement, QCAgent enables controllable generation of clinically meaningful and high-coverage pathology reports from WSI.
80. 【2603.01640】MSP-ReID: Hairstyle-Robust Cloth-Changing Person Re-Identification
链接:https://arxiv.org/abs/2603.01640
作者:Xiangyang He,Lin Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:varying clothing conditions, Cloth-Changing Person Re-Identification, aims to match, individual across cameras, cameras under varying
备注: 8 pages, 3 figures. Accepted to the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
点击查看摘要
Abstract:Cloth-Changing Person Re-Identification (CC-ReID) aims to match the same individual across cameras under varying clothing conditions. Existing approaches often remove apparel and focus on the head region to reduce clothing bias. However, treating the head holistically without distinguishing between face and hair leads to over-reliance on volatile hairstyle cues, causing performance degradation under hairstyle changes. To address this issue, we propose the Mitigating Hairstyle Distraction and Structural Preservation (MSP) framework. Specifically, MSP introduces Hairstyle-Oriented Augmentation (HSOA), which generates intra-identity hairstyle diversity to reduce hairstyle dependence and enhance attention to stable facial and body cues. To prevent the loss of structural information, we design Cloth-Preserved Random Erasing (CPRE), which performs ratio-controlled erasing within clothing regions to suppress texture bias while retaining body shape and context. Furthermore, we employ Region-based Parsing Attention (RPA) to incorporate parsing-guided priors that highlight face and limb regions while suppressing hair features. Extensive experiments on multiple CC-ReID benchmarks demonstrate that MSP achieves state-of-the-art performance, providing a robust and practical solution for long-term person re-identification.
81. 【2603.01637】DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
链接:https://arxiv.org/abs/2603.01637
作者:Enhui Ma,Jiahuan Zhang,Guantian Zheng,Tao Tang,Shengbo Eben Li,Yuhang Lu,Xia Zhou,Xueyang Zhang,Yifei Zhan,Kun Zhan,Zhihui Hao,Xianpeng Lang,Kaicheng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
82. 【2603.01623】Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
链接:https://arxiv.org/abs/2603.01623
作者:Jiaqi Han,Juntong Shi,Puheng Li,Haotian Ye,Qiushan Guo,Stefano Ermon
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:inference speed due, numerous iterative passes, Diffusion Transformers, dominant tool, tool for high-fidelity
备注: CVPR 2026
点击查看摘要
Abstract:Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
83. 【2603.01613】Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment
链接:https://arxiv.org/abs/2603.01613
作者:Yuchen Zou,Xiao Hu,Dexing Zhong,Yuqing Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monocular re-localization plays, achieve human-like perception, Monocular re-localization, human-like perception, enabling intelligent agents
备注: 7 pages, 4 figures
点击查看摘要
Abstract:Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.
84. 【2603.01605】What Helps -- and What Hurts: Bidirectional Explanations for Vision Transformers
链接:https://arxiv.org/abs/2603.01605
作者:Qin Su,Tie Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:achieve strong performance, decision-making remains difficult, Vision Transformers, achieve strong, visual recognition
备注: PAKDD 2026: The 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining
点击查看摘要
Abstract:Vision Transformers (ViTs) achieve strong performance in visual recognition, yet their decision-making remains difficult to interpret. We propose BiCAM, a bidirectional class activation mapping method that captures both supportive (positive) and suppressive (negative) contributions to model predictions. Unlike prior CAM-based approaches that discard negative signals, BiCAM preserves signed attributions to produce more complete and contrastive explanations. BiCAM further introduces a Positive-to-Negative Ratio (PNR) that summarizes attribution balance and enables lightweight detection of adversarial examples without retraining. Across ImageNet, VOC, and COCO, BiCAM improves localization and faithfulness while remaining computationally efficient. It generalizes to multiple ViT variants, including DeiT and Swin. These results suggest the importance of modeling both supportive and suppressive evidence for interpreting transformer-based vision models.
85. 【2603.01603】Sparse View Distractor-Free Gaussian Splatting
链接:https://arxiv.org/abs/2603.01603
作者:Yi Gu,Zhaorui Wang,Jiahang Cao,Jiaxu Wang,Mingle Zhao,Dongjun Ye,Renjing Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, fast novel view, view synthesis, enables efficient training, Gaussian
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.
86. 【2603.01602】YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection
链接:https://arxiv.org/abs/2603.01602
作者:PeiHuang Zheng,Yunlong Zhao,Zheng Cui,Yang Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Human vision exhibits, vision exhibits remarkable, exhibits remarkable adaptability, Human vision, vision exhibits
备注: 9 pages,6 figures
点击查看摘要
Abstract:Human vision exhibits remarkable adaptability in perceiving objects under camouflage. When color cues become unreliable, the visual system instinctively shifts its reliance from chrominance (color) to luminance (brightness and texture), enabling more robust perception in visually confusing environments. Drawing inspiration from this biological mechanism, we propose YCDa, an efficient early-stage feature processing strategy that embeds this "chrominance-luminance decoupling and dynamic attention" principle into modern real-time detectors. Specifically, YCDa separates color and luminance information in the input stage and dynamically allocates attention across channels to amplify discriminative cues while suppressing misleading color noise. The strategy is plug-and-play and can be integrated into existing detectors by simply replacing the first downsampling layer. Extensive experiments on multiple baselines demonstrate that YCDa consistently improves performance with negligible overhead as shown in Fig. Notably, YCDa-YOLO12s achieves a 112% improvement in mAP over the baseline on COD10K-D and sets new state-of-the-art results for real-time camouflaged object detection across COD-D datasets.
87. 【2603.01601】Dehallu3D: Hallucination-Mitigated 3D Generation from Single Image via Cyclic View Consistency Refinement
链接:https://arxiv.org/abs/2603.01601
作者:Xiwen Wang,Shichao Zhang,Hailun Zhang,Ruowei Wang,Mao Li,Chenyu Zhou,Qijun Zhao,Ji-Zhe Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling broad applications, reconstruction models, reconstruction models suffer, enabling broad, reality and gaming
备注:
点击查看摘要
Abstract:Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine this http URL further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers.
88. 【2603.01594】Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference
链接:https://arxiv.org/abs/2603.01594
作者:Jiaqi Leng,Shuyuan Tu,Haidong Cao,Sicheng Xie,Daoguo Dong,Zuxuan Wu,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:presents a critical, critical yet underexplored, underexplored challenge, preference alignment presents, preference
备注:
点击查看摘要
Abstract:Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.
89. 【2603.01593】PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
链接:https://arxiv.org/abs/2603.01593
作者:Bo Ma,Jinsong Wu,Weiqi Yan,Catherine Shi,Minh Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dashcam videos collected, Dashcam videos, videos collected, collected by autonomous, autonomous or assisted-driving
备注:
点击查看摘要
Abstract:Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in this https URL
90. 【2603.01591】FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems
链接:https://arxiv.org/abs/2603.01591
作者:Minwoo Kim,Seunghyeok Shin,Hongki Lim
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion priors enable, priors enable inverse-problem, nonlinear forward operators, forward operators data, operators data consistency
备注:
点击查看摘要
Abstract:Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel$\rightarrow$latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5$\times$ speedup, without hand-coded adjoints or inner MCMC.
91. 【2603.01586】InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
链接:https://arxiv.org/abs/2603.01586
作者:Yecong Wan,Fan Li,Chunwei Wang,Hao Wu,Mingwen Shao,Wangmeng Zuo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Emerging unified editing, demonstrated strong capabilities, Emerging unified, unified editing models, models have demonstrated
备注:
点击查看摘要
Abstract:Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
92. 【2603.01579】SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis
链接:https://arxiv.org/abs/2603.01579
作者:Chuqiao Wu,Jin Song,Yiyun Fei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:existing scenes remains, current generative models, realistic and structurally, existing scenes, scenes remains
备注:
点击查看摘要
Abstract:Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.
93. 【2603.01576】Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
链接:https://arxiv.org/abs/2603.01576
作者:Saurabh Kaushik,Lalit Maurya,Beth Tellman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse Earth observation, demonstrated strong potential, producing reliable maps, Earth observation task, diverse Earth
备注:
点击查看摘要
Abstract:Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{this https URL}{GitHub}).
94. 【2603.01568】Rate-Distortion Signatures of Generalization and Information Trade-offs
链接:https://arxiv.org/abs/2603.01568
作者:Leyla Roksan Caglar,Pedro A.M. Mediano,Baihan Lin
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)
关键词:visual conditions remains, offer limited insight, metrics offer limited, visual conditions, conditions remains
备注:
点击查看摘要
Abstract:Generalization to novel visual conditions remains a central challenge for both human and machine vision, yet standard robustness metrics offer limited insight into how systems trade accuracy for robustness. We introduce a rate-distortion-theoretic framework that treats stimulus-response behavior as an effective communication channel, derives rate-distortion (RD) frontiers from confusion matrices, and summarizes each system with two interpretable geometric signatures - slope ($\beta$) and curvature ($\kappa$) - which capture the marginal cost and abruptness of accuracy-robustness trade-offs. Applying this framework to human psychophysics and 18 deep vision models under controlled image perturbations, we compare generalization geometry across model architectures and training regimes. We find that both biological and artificial systems follow a common lossy-compression principle but occupy systematically different regions of RD space. In particular, humans exhibit smoother, more flexible trade-offs, whereas modern deep networks operate in steeper and more brittle regimes even at matched accuracy. Across training regimes, robustness training induces systematic but dissociable shifts in beta/kappa, revealing cases where improved robustness or accuracy does not translate into more human-like generalization geometry. These results demonstrate that RD geometry provides a compact, model-agnostic lens for comparing generalization behavior across systems beyond standard accuracy-based metrics.
95. 【2603.01558】opoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
链接:https://arxiv.org/abs/2603.01558
作者:Muhammet Esat Kalfaoglu,Halil Ibrahim Ozturk,Ozsel Kilinc,Alptekin Temizel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Mask-based paradigms, dense rasterized intermediate, rasterized intermediate representation, road topology understanding, offer a complementary
备注:
点击查看摘要
Abstract:Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.
96. 【2603.01552】Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder
链接:https://arxiv.org/abs/2603.01552
作者:Ayantika Das,Keerthi Ram,Mohanasankar Sivaprakasam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:longitudinal human brain, human brain images, brain images offer, track neurodegenerative progression, existing generative approaches
备注:
点击查看摘要
Abstract:Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer's. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer's disease progression.
97. 【2603.01549】Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
链接:https://arxiv.org/abs/2603.01549
作者:Jisoo Kim,Jungbin Cho,Sanghyeok Chu,Ananya Bal,Jinhyung Kim,Gunhee Lee,Sihaeng Lee,Seung Hwan Kim,Bohyung Han,Hyunmin Lee,Laszlo A. Jeni,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:surrounding world responds, Humans learn, bodies move, Humans, VLA
备注:
点击查看摘要
Abstract:Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
98. 【2603.01547】PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification
链接:https://arxiv.org/abs/2603.01547
作者:Jian Yu,Joakim Nguyen,Jinrui Fang,Awais Naeem,Zeyuan Cao,Sanjay Krishnan,Nicholas Konz,Tianlong Chen,Chandra Krishnan,Hairong Wang,Edward Castillo,Ying Ding,Ankita Shukla
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:central nervous system, remains challenging due, limited training data, pediatric central nervous, nervous system tumors
备注:
点击查看摘要
Abstract:Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H\E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.
99. 【2603.01545】raining-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory
链接:https://arxiv.org/abs/2603.01545
作者:Zhengtong Zhu,Jiaqing Fan,Zhixuan Liu,Fanzhang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex textual inputs, Multimodal Large Language, Large Language Models, Reasoning Video Segmentation, fine-tune Multimodal Large
备注: Accept by AAAI2026
点击查看摘要
Abstract:Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
100. 【2603.01544】RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry
链接:https://arxiv.org/abs/2603.01544
作者:Xinchang Wang,Yunhao Chen,Yuechen Zhang,Congcong Bian,Zihao Guo,Xingjun Ma,Hui Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:downstream recognition systems, produce photo-realistic content, Recent image generators, generators produce photo-realistic, Recent image
备注:
点击查看摘要
Abstract:Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.
101. 【2603.01535】Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
链接:https://arxiv.org/abs/2603.01535
作者:Zijin Yin,Bing Li,Kongming Liang,Hao Sun,Zhongjiang He,Zhanyu Ma,Jun Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, semantic segmentation models, Semantic segmentation, segmentation models, pivotal roles
备注: Submitted to IEEE TPAMI, under review
点击查看摘要
Abstract:Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.
102. 【2603.01528】Boosting AI Reliability with an FSM-Driven Streaming Inference Pipeline: An Industrial Case
链接:https://arxiv.org/abs/2603.01528
作者:Yutian Zhang,Zhongyi Pei,Yi Mao,Chen Wang,Lin Liu,Jianmin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:bias and vulnerabilities, widespread adoption, absent from training, Finite State Machine, training data
备注: Preprint. The work was done in 2024
点击查看摘要
Abstract:The widespread adoption of AI in industry is often hampered by its limited robustness when faced with scenarios absent from training data, leading to prediction bias and vulnerabilities. To address this, we propose a novel streaming inference pipeline that enhances data-driven models by explicitly incorporating prior knowledge. This paper presents the work on an industrial AI application that automatically counts excavator workloads from surveillance videos. Our approach integrates an object detection model with a Finite State Machine (FSM), which encodes knowledge of operational scenarios to guide and correct the AI's predictions on streaming data. In experiments on a real-world dataset of over 7,000 images from 12 site videos, encompassing more than 300 excavator workloads, our method demonstrates superior performance and greater robustness compared to the original solution based on manual heuristic rules. We will release the code at this https URL.
103. 【2603.01524】Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection
链接:https://arxiv.org/abs/2603.01524
作者:Qirui Wu,Shizhou Zhang,De Cheng,Yinghui Xing,Lingyan Ran,Dahu Shi,Peng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Incremental Object Detection, Object Detection, aims to continuously, continuously learn, Incremental Object
备注: Accepted in AAAI2026
点击查看摘要
Abstract:Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.
104. 【2603.01515】FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
链接:https://arxiv.org/abs/2603.01515
作者:Hanxiao Wang,Yuan-Chen Guo,Ying-Tian Liu,Zi-Xin Zou,Biao Zhang,Weize Quan,Ding Liang,Yan-Pei Cao,Dong-Ming Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long vertex-coordinate sequences, long vertex-coordinate, mesh generation suffer, fundamental limitation, Autoregressive Autoencoder
备注:
点击查看摘要
Abstract:Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
105. 【2603.01509】Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling
链接:https://arxiv.org/abs/2603.01509
作者:Zillur Rahman,Alex Sheng,Cristian Meo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:driven significant progress, remain highly sensitive, models remain highly, large-scale datasets, datasets have driven
备注: 2026 ICLR TTU Workshop
点击查看摘要
Abstract:While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
106. 【2603.01506】OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
链接:https://arxiv.org/abs/2603.01506
作者:Jianqiang Ren,Lin Liu,Steven Hoi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian representation, One-shot method, leverages a Multi-LOD, representation for animatable, single image
备注:
点击查看摘要
Abstract:We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.
107. 【2603.01498】ri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection
链接:https://arxiv.org/abs/2603.01498
作者:Kai Zheng,Hang-Cheng Dong,Zhenkai Wu,Fupeng Wei,Wei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remote sensing imagery, complex scene variations, Tripath DINO architecture, fine grained monitoring, sensing imagery
备注:
点击查看摘要
Abstract:In remote sensing imagery, multi class change detection (MCD) is crucial for fine grained monitoring, yet it has long been constrained by complex scene variations and the scarcity of detailed annotations. To address this, we propose the Tripath DINO architecture, which adopts a three path complementary feature learning strategy to facilitate the rapid adaptation of pre trained foundation models to complex vertical domains. Specifically, we employ the DINOv3 pre trained model as the backbone feature extraction network to learn coarse grained features. An auxiliary path also adopts a siamese structure, progressively aggregating intermediate features from the siamese encoder to enhance the learning of fine grained features. Finally, a multi scale attention mechanism is introduced to augment the decoder network, where parallel convolutions adaptively capture and enhance contextual information under different receptive fields. The proposed method achieves optimal performance on the MCD task on both the Gaza facility damage assessment dataset (Gaza change) and the classic SECOND dataset. GradCAM visualizations further confirm that the main and auxiliary paths naturally focus on coarse grained semantic changes and fine grained structural details, respectively. This synergistic complementarity provides a robust and interpretable solution for advanced change detection tasks, offering a basis for rapid and accurate damage assessment.
108. 【2603.01493】PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
链接:https://arxiv.org/abs/2603.01493
作者:Tianyi Xu,Rong Shan,Junjie Wu,Jiadeng Huang,Teng Wang,Jiachen Zhu,Wenteng Chen,Minxin Tu,Quantao Dou,Zhaoxiang Wang,Changwang Zhang,Weinan Zhang,Jun Wang,Jianghao Lin
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:ecological archives defined, photo retrieval non-trivial, Personal photo albums, ecological archives, personalized photo retrieval
备注: Under review
点击查看摘要
Abstract:Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.
109. 【2603.01491】Radiometrically Consistent Gaussian Surfels for Inverse Rendering
链接:https://arxiv.org/abs/2603.01491
作者:Kyu Beom Han,Jaeyoon Kim,Woo Jae Kim,Jinhwan Seo,Sung-eui Yoon
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:accurately disentangling material, disentangling material properties, Splatting has advanced, Gaussian Splatting, Gaussian
备注: 9 pages, 6 figures, ICLR 2026 Oral paper
点击查看摘要
Abstract:Inverse rendering with Gaussian Splatting has advanced rapidly, but accurately disentangling material properties from complex global illumination effects, particularly indirect illumination, remains a major challenge. Existing methods often query indirect radiance from Gaussian primitives pre-trained for novel-view synthesis. However, these pre-trained Gaussian primitives are supervised only towards limited training viewpoints, thus lack supervision for modeling indirect radiances from unobserved views. To address this issue, we introduce radiometric consistency, a novel physically-based constraint that provides supervision towards unobserved views by minimizing the residual between each Gaussian primitive's learned radiance and its physically-based rendered counterpart. Minimizing the residual for unobserved views establishes a self-correcting feedback loop that provides supervision from both physically-based rendering and novel-view synthesis, enabling accurate modeling of inter-reflection. We then propose Radiometrically Consistent Gaussian Surfels (RadioGS), an inverse rendering framework built upon our principle by efficiently integrating radiometric consistency by utilizing Gaussian surfels and 2D Gaussian ray tracing. We further propose a finetuning-based relighting strategy that adapts Gaussian surfel radiances to new illuminations within minutes, achieving low rendering cost (10ms). Extensive experiments on existing inverse rendering benchmarks show that RadioGS outperforms existing Gaussian-based methods in inverse rendering, while retaining the computational efficiency.
110. 【2603.01490】ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models
链接:https://arxiv.org/abs/2603.01490
作者:Cheng Yang,Jianhao Jiao,Lingyi Huang,Jinqi Xiao,Zhexiang Tang,Yu Gong,Yibiao Ying,Yang Sui,Jintian Lin,Wen Huang,Bo Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:language instructions, current observations, robot states, rely on current, predict actions
备注: Accepted by ICRA 2026
点击查看摘要
Abstract:Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.
111. 【2603.01485】SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout
链接:https://arxiv.org/abs/2603.01485
作者:Brian Cheong,Letian Wang,Sandro Papais,Steven L. Waslander
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Track Query Dropout, Track Query, Chance Assignment, Query Dropout, false negative errors
备注:
点击查看摘要
Abstract:LiDAR-based tracking-by-attention (TBA) frameworks inherently suffer from high false negative errors, leading to a significant performance gap compared to traditional LiDAR-based tracking-by-detection (TBD) methods. This paper introduces SCATR, a novel LiDAR-based TBA model designed to address this fundamental challenge systematically. SCATR leverages recent progress in vision-based tracking and incorporates targeted training strategies specifically adapted for LiDAR. Our work's core innovations are two architecture-agnostic training strategies for TBA methods: Second Chance Assignment and Track Query Dropout. Second Chance Assignment is a novel ground truth assignment that concatenates unassigned track queries to the proposal queries before bipartite matching, giving these track queries a second chance to be assigned to a ground truth object and effectively mitigating the conflict between detection and tracking tasks inherent in tracking-by-attention. Track Query Dropout is a training method that diversifies supervised object query configurations to efficiently train the decoder to handle different track query sets, enhancing robustness to missing or newborn tracks. Experiments on the nuScenes tracking benchmark demonstrate that SCATR achieves state-of-the-art performance among LiDAR-based TBA methods, outperforming previous works by 7.6\% AMOTA and successfully bridging the long-standing performance gap between LiDAR-based TBA and TBD methods. Ablation studies further validate the effectiveness and generalization of Second Chance Assignment and Track Query Dropout. Code can be found at the following link: \href{this https URL}{this https URL}
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.01485 [cs.CV]
(or
arXiv:2603.01485v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.01485
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Brian Cheong [view email] [v1]
Mon, 2 Mar 2026 05:50:54 UTC (329 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled SCATR: Mitigating New Instance Suppression in LiDAR-based Tracking-by-Attention via Second Chance Assignment and Track Query Dropout, by Brian Cheong and 3 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2026-03
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
112. 【2603.01475】WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments
链接:https://arxiv.org/abs/2603.01475
作者:Joshua Knights,Joseph Reid,Kaushik Roy,David Hall,Mark Cox,Peyman Moghadam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:alongside growing interest, Recent years, unstructured natural environments, scene understanding, alongside growing
备注: IEEE International Conference on Robotics Automation (ICRA) 2026
点击查看摘要
Abstract:Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
113. 【2603.01461】UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation
链接:https://arxiv.org/abs/2603.01461
作者:Teng Wang,Haojun Jiang,Chenxi Li,Diwen Wang,Yihang Tang,Zhenguo Sun,Yujiao Deng,Shiji Song,Gao Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diagnosing cardiovascular diseases, timely patient care, high operational difficulties, hinders timely patient, Echocardiography is critical
备注:
点击查看摘要
Abstract:Echocardiography is critical for diagnosing cardiovascular diseases, yet the shortage of skilled sonographers hinders timely patient care, due to high operational difficulties. Consequently, research on automated probe navigation has significant clinical potential. To achieve robust navigation, it is essential to leverage historical scanning information, mimicking how experts rely on past feedback to adjust subsequent maneuvers. Practical scanning data collected from sonographers typically consists of noisy trajectories inherently generated through trial-and-error exploration. However, existing methods typically model this history as a sequential chain, forcing models to overfit these noisy paths, leading to performance degradation on long sequences. In this paper, we propose UltraStar, which reformulates probe navigation from path regression to anchor-based global localization. By establishing a Star Graph, UltraStar treats historical keyframes as spatial anchors connected directly to the current view, explicitly modeling geometric constraints for precise positioning. We further enhance the Star Graph with a semantic-aware sampling strategy that actively selects the representative landmarks from massive history logs, reducing redundancy for accurate anchoring. Extensive experiments on a dataset with over 1.31 million samples demonstrate that UltraStar outperforms baselines and scales better with longer input lengths, revealing a more effective topology for history modeling under noisy exploration.
114. 【2603.01455】From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
链接:https://arxiv.org/abs/2603.01455
作者:Niu Lian,Yuting Wang,Hanshu Yao,Jinpeng Wang,Bin Chen,Yaowei Wang,Min Zhang,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:impressive short-term reasoning, human cognitive efficiency, large language models, demonstrated impressive short-term, long-horizon video understanding
备注: TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary. 16 pages, 7 figures, 7 tables
点击查看摘要
Abstract:While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at this https URL.
115. 【2603.01454】VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models
链接:https://arxiv.org/abs/2603.01454
作者:Duoxun Tang,Dasen Dai,Jiyao Wang,Xiao Yang,Jianyu Wang,Siqi Cai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:exhaust computational resources, Energy-Latency Attacks, computational resources, increasingly deployed, deployed in safety-critical
备注:
点击查看摘要
Abstract:Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.
116. 【2603.01450】Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection
链接:https://arxiv.org/abs/2603.01450
作者:Jianfeng Liao,Yichen Wei,Raymond Chan Ching Bon,Shulan Wang,Kam-Pui Chow,Kwok-Yan Lam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation techniques poses, techniques poses significant, highly realistic synthetic, synthetic facial media, deepfake generation techniques
备注: Accepted at ICDF2C 2025
点击查看摘要
Abstract:The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model's ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at this https URL
117. 【2603.01441】Unifying Language-Action Understanding and Generation for Autonomous Driving
链接:https://arxiv.org/abs/2603.01441
作者:Xinyang Wang,Qian Liu,Wenjie Ding,Zhao Yang,Wei Li,Chang Liu,Bailin Li,Kun Zhan,Xianpeng Lang,Wei Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:leverage world knowledge, complex driving scenes, promising paradigm, potential to leverage, leverage world
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
118. 【2603.01433】DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis
链接:https://arxiv.org/abs/2603.01433
作者:Zengqi Zhao,Weidi Xia,Peter Wei,Yan Zhang,Yiyi Zhang,Jane Mo,Tiannan Zhang,Yuanqin Dai,Zexi Chen,Simiao Ren
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spanning text tampering, identity document manipulation, datasets spanning text, unified zero-shot benchmark, present DOCFORGE-BENCH
备注:
点击查看摘要
Abstract:We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
119. 【2603.01431】SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation
链接:https://arxiv.org/abs/2603.01431
作者:Yingjian Zhu,Ying Wang,Yuyang Hong,Ruohao Guo,Kun Ding,Xin Gu,Bin Fan,Shiming Xiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:track individual sounding, aiming to identify, segment and track, track individual, Cross Attention Fusion
备注: Accepted by Machine Intelligence Research
点击查看摘要
Abstract:Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.
120. 【2603.01418】UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
链接:https://arxiv.org/abs/2603.01418
作者:Hebeizi Li,Zihao Liang,Benyuan Sun,Zihao Yin,Xiao Sha,Chenliang Wang,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
关键词:closed-source nature makes, training paradigms inaccessible, demonstrate remarkable capabilities, remarkable capabilities, paradigms inaccessible
备注: Accepted at CVPR 2026 (Findings Track)
点击查看摘要
Abstract:While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
121. 【2603.01412】UETrack: A Unified and Efficient Framework for Single Object Tracking
链接:https://arxiv.org/abs/2603.01412
作者:Ben Kang,Jie Zhao,Xin Chen,Wanting Geng,Bin Zhang,Lu Zhang,Dong Wang,Huchuan Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:growing real-world demands, received increasing attention, real-world demands, increasing attention, growing real-world
备注:
点击查看摘要
Abstract:With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at this https URL.
122. 【2603.01400】oken Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
链接:https://arxiv.org/abs/2603.01400
作者:Jinlong Li,Liyuan Jiang,Haonan Zhang,Nicu Sebe
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Video Large Language, Language Models, Large Language, demonstrate strong video
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{this https URL}{AOT}.
123. 【2603.01398】Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
链接:https://arxiv.org/abs/2603.01398
作者:Junwei Zeng,Dong Liang,Sheng-Jun Huang,Kun Zhan,Songcan Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-level vision tasks, significantly degrades long-range, degrades long-range imaging, introducing geometric warping, turbulence significantly degrades
备注: Accepted to CVPR 2026!
点击查看摘要
Abstract:Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. The dataset is publicly available at: this http URL.
124. 【2603.01371】IMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity
链接:https://arxiv.org/abs/2603.01371
作者:Xiao Cai,Lianli Gao,Pengpeng Zeng,Ji Zhang,Heng Tao Shen,Jingkuan Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:downstream real-world applications, Precise spatial fidelity, Precise spatial, real-world applications, critical for downstream
备注:
点击查看摘要
Abstract:Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.
125. 【2603.01361】MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
链接:https://arxiv.org/abs/2603.01361
作者:Zilong Zhao,Zhengming Ding,Pei Niu,Wenhao Sun,Feng Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Feature encoders play, Feature encoders, thin structures, play a key, key role
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at this https URL.
126. 【2603.01332】Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth
链接:https://arxiv.org/abs/2603.01332
作者:Andrew Wang,Mike Davies
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling real-time imaging, full-resolution spectral images, reconstruct full-resolution spectral, enabling real-time, autonomous driving
备注:
点击查看摘要
Abstract:Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.
127. 【2603.01328】You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image
链接:https://arxiv.org/abs/2603.01328
作者:Taoyue Wang,Xiang Zhang,Xiaotian Li,Huiyuan Yang,Lijun Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:single Blind Face, Blind Face image, generating consistent Novel-View, Blind Face, single Blind
备注:
点击查看摘要
Abstract:We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.
128. 【2603.01324】Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding
链接:https://arxiv.org/abs/2603.01324
作者:Anna Michailidou,Georgios Angelidis,Vasileios Argyriou,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:post-disaster damage assessment, Aerial imagery, damage assessment, imagery is critical, large-scale post-disaster damage
备注: 7 pages, 2 figures
点击查看摘要
Abstract:Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.
129. 【2603.01305】AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
链接:https://arxiv.org/abs/2603.01305
作者:Zhen Qu,Xian Tao,Xiaoyi Bao,Dingrong Wang,ShiChen Qu,Zhengtao Zhang,Xingang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large multimodal models, exhibit strong task, task generalization capabilities, strong task generalization, Large multimodal
备注:
点击查看摘要
Abstract:Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
130. 【2603.01301】When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
链接:https://arxiv.org/abs/2603.01301
作者:Ahmadreza Jeddi,Kimia Shaban,Negin Baghbanzadeh,Natasha Sharan,Abhishek Moturu,Elham Dolatabadi,Babak Taati
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reinforcement learning, post-train medical Vision-Language, supervised fine-tuning, remains unclear, behaviors already induced
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
131. 【2603.01295】Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis
链接:https://arxiv.org/abs/2603.01295
作者:Abdullah Al Shafi,Md Kawsar Mahmud Khan Zunayed,Safin Ahmmed,Sk Imran Hossain,Engelbert Mephu Nguifo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:interpretation requires simultaneous, ultrasound interpretation requires, requires simultaneous lesion, simultaneous lesion segmentation, interpretation requires
备注: 10 pages, 3 figures, 2 tables. The code is available at: [this https URL](https://github.com/C-loud-Nine/Uncertainty-Aware-Multi-Level-Decoder-Interaction)
点击查看摘要
Abstract:Breast ultrasound interpretation requires simultaneous lesion segmentation and tissue classification. However, conventional multi-task learning approaches suffer from task interference and rigid coordination strategies that fail to adapt to instance-specific prediction difficulty. We propose a multi-task framework addressing these limitations through multi-level decoder interaction and uncertainty-aware adaptive coordination. Task Interaction Modules operate at all decoder levels, establishing bidirectional segmentation-classification communication during spatial reconstruction through attention weighted pooling and multiplicative modulation. Unlike prior single-level or encoder-only approaches, this multi-level design captures scale specific task synergies across semantic-to-spatial scales, producing complementary task interaction streams. Uncertainty-Proxy Attention adaptively weights base versus enhanced features at each level using feature activation variance, enabling per-level and per-sample task balancing without heuristic tuning. To support instance-adaptive prediction, multi-scale context fusion captures morphological cues across varying lesion sizes. Evaluation on multiple publicly available breast ultrasound datasets demonstrates competitive performance, including 74.5% lesion IoU and 90.6% classification accuracy on BUSI dataset. Ablation studies confirm that multi-level task interaction provides significant performance gains, validating that decoder-level bidirectional communication is more effective than conventional encoder-only parameter sharing. The code is available at: this https URL.
132. 【2603.01284】FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration
链接:https://arxiv.org/abs/2603.01284
作者:Yizhou Huang,Gengze Jiang,Yihua Cheng,Kezhi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate trajectory prediction, safe autonomous driving, existing approaches struggle, balance modeling power, Accurate trajectory
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with O(N) complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5% and parameters by over 40%. Comprehensive ablations confirm the necessity of each component.
133. 【2603.01267】Certifiable Estimation with Factor Graphs
链接:https://arxiv.org/abs/2603.01267
作者:Zhexin Xu,Nikolas R. Sanderson,Hanna Jiamei Zhang,David M. Rosen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:reusable building blocks, convenient modular modeling, modular modeling language, high-performance robotic state, state estimation systems
备注:
点击查看摘要
Abstract:Factor graphs provide a convenient modular modeling language that enables practitioners to design and deploy high-performance robotic state estimation systems by composing simple, reusable building blocks. However, inference in these models is typically performed using local optimization methods that can converge to suboptimal solutions, a serious reliability concern in safety-critical applications. Conversely, certifiable estimators based on convex relaxation can recover verifiably globally optimal solutions in many practical settings, but the computational cost of solving their large-scale relaxations necessitates specialized, structure-exploiting solvers that require substantial expertise to implement, significantly hampering practical deployment. In this paper, we show that these two paradigms, which have thus far been treated as independent in the literature, can be naturally synthesized into a unified framework for certifiable factor graph optimization. The key insight is that factor graph structure is preserved under Shor's relaxation and Burer-Monteiro factorization: applying these transformations to a QCQP with an associated factor graph representation yields a lifted problem admitting a factor graph model with identical connectivity, in which variables and factors are simple one-to-one algebraic transformations of those in the original QCQP. This structural preservation enables the Riemannian Staircase methodology for certifiable estimation to be implemented using the same mature, highly-performant factor graph libraries and workflows already ubiquitously employed throughout robotics and computer vision, making certifiable estimation as straightforward to design and deploy as conventional factor graph inference.
Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.01267 [cs.RO]
(or
arXiv:2603.01267v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2603.01267
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
134. 【2603.01253】Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography
链接:https://arxiv.org/abs/2603.01253
作者:Timofey Efimov,Singanallur Venkatakrishnan,Maliha Hossain,Haley Duba-Sullivan,Amirkoushyar Ziabari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:solving inverse problems, emerged as powerful, solving inverse, inverse problems, Diffusion models
备注: Accepted at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
点击查看摘要
Abstract:Diffusion models have emerged as powerful priors for solving inverse problems in computed tomography (CT). In certain applications, such as neutron CT, it can be expensive to collect large amounts of measurements even for a single scan, leading to sparse data sets from which it is challenging to obtain high quality reconstructions even with diffusion models. One strategy to mitigate this challenge is to leverage a complementary, easily available imaging modality; however, such approaches typically require retraining the diffusion model with large datasets. In this work, we propose incorporating an additional modality without retraining the diffusion prior, enabling accelerated imaging of costly modalities. We further examine the impact of imperfect side modalities on cross-modal guidance. Our method is evaluated on sparse-view neutron computed tomography, where reconstruction quality is substantially improved by incorporating X-ray computed tomography of the same samples.
135. 【2603.01250】he MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction
链接:https://arxiv.org/abs/2603.01250
作者:Lidia Garrucho,Smriti Joshi,Kaisar Kushibar,Richard Osuala,Maciej Bobowicz,Xavier Bargalló,Paulius Jaruševičius,Kai Geissler,Raphael Schäfer,Muhammad Alberb,Tony Xu,Anne Martel,Daniel Sleiman,Navchetan Awasthi,Hadeel Awwad,Joan C. Vilanova,Robert Martí,Daan Schouten,Jeong Hoon Lee,Mirabela Rusu,Eleonora Poeta,Luisa Vargas,Eliana Pastor,Maria A. Zuluaga,Jessica Kächele,Dimitrios Bounias,Alexandra Ertl,Katarzyna Gwoździewicz,Maria-Laura Cosaka,Pasant M. Abo-Elhoda,Sara W. Tantawy,Shorouq S. Sakrana,Norhan O. Shawky-Abdelfatah,Amr Muhammad Abdo-Salem,Androniki Kozana,Eugen Divjak,Gordana Ivanac,Katerina Nikiforaki,Michail E. Klontzas,Rosa García-Dosdá,Meltem Gulsun-Akpinar,Oğuz Lafcı,Carlos Martín-Isla,Oliver Díaz,Laura Igual,Karim Lekadir
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:frequently diagnosed malignancy, magnetic resonance imaging, cancer-related mortality, magnetic resonance, frequently diagnosed
备注:
点击查看摘要
Abstract:Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.
136. 【2603.01236】AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
链接:https://arxiv.org/abs/2603.01236
作者:Changwoo Baek,Jouwon Song,Sohyeon Kim,Kyeongbo Kong
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, mitigate substantial computational, substantial computational overhead, computational overhead incurred, Large Vision-Language
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at this https URL.
137. 【2603.01228】owards Policy-Adaptive Image Guardrail: Benchmark and Method
链接:https://arxiv.org/abs/2603.01228
作者:Caiyong Piao,Zhiyuan Yan,Haoming Xu,Yunzhen Zhao,Kaiqing Lin,Feiyang Xu,Shuigeng Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:harmful visual content, application scenarios, harmful visual, rejection of sensitive, visual content
备注:
点击查看摘要
Abstract:Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
138. 【2603.01224】Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction
链接:https://arxiv.org/abs/2603.01224
作者:Ari Wahl,Dorian Gawlinski,David Przewozny,Paul Chojecki,Felix Bießmann,Sebastian Bosse
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:Pre-trained general-purpose Vision-Language, enhance intuitive human-machine, intuitive human-machine interactions, human-machine interactions due, rich world knowledge
备注: Accepted at Workshop on Integrating Image Processing with Large-Scale Language/Vision Models for Advanced Visual Understanding (LVLM) at IEEE International Conference on Image Processing (ICIP) 2025
点击查看摘要
Abstract:Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
139. 【2603.01205】CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling
链接:https://arxiv.org/abs/2603.01205
作者:Li Jin,Weikai Chen,Yujie Wang,Yingda Yin,Zeyu Hu,Runze Zhang,Keyang Luo,Shengju Qian,Xin Wang,Xueying Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:input sensor coordinates, segmentation remains brittle, sensor coordinates, remains brittle, canonical
备注:
点击查看摘要
Abstract:Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.
140. 【2603.01195】VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning
链接:https://arxiv.org/abs/2603.01195
作者:Mingkang Dong,Hongyi Cai,Jie Li,Sifan Zhou,Bin Ren,Kunyu Peng,Yuqian Fu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:require visual reasoning, samples genuinely require, genuinely require visual, genuinely require, instruction tuning depends
备注: 17 pages, 4 figures
点击查看摘要
Abstract:The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.
141. 【2603.01194】RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
链接:https://arxiv.org/abs/2603.01194
作者:Mochu Xiang,Zhelun Shen,Xuesong Li,Jiahui Ren,Jing Zhang,Chen Zhao,Shanshan Liu,Haocheng Feng,Jingdong Wang,Yuchao Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Human perceive, limited viewpoints, Human, reconstruction, Abstract
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: this https URL
142. 【2603.01174】VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification
链接:https://arxiv.org/abs/2603.01174
作者:Abdellah Zakaria Sellam,Fadi Abdeladhim Zidi,Salah Eddine Bekhouche,Ihssen Houhou,Marouane Tliba,Cosimo Distante,Abdenour Hadid
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-dimensional spectral data, Accurate classification, hyperspectral imagery, tension between high-dimensional, extreme scarcity
备注:
点击查看摘要
Abstract:Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.
143. 【2603.01169】ripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
链接:https://arxiv.org/abs/2603.01169
作者:Sumin Kim,Hyemin Jeong,Mingu Kang,Yejin Kim,Yoori Oh,Joonseok Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:content necessitates effective, efficiently extract key, extract key information, video content necessitates, necessitates effective video
备注: Published as a Conference Paper at ICLR 2026
点击查看摘要
Abstract:The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at this https URL.
144. 【2603.01164】FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing
链接:https://arxiv.org/abs/2603.01164
作者:Maomao Li,Yunfei Liu,Yu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:propagate edit contents, Image-driven video editing, source video, video editing aims, video
备注: 13 pages
点击查看摘要
Abstract:Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: this https URL.
145. 【2603.01163】BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
链接:https://arxiv.org/abs/2603.01163
作者:Jiachen Yang,Xianhui Lin,Yi Dong,Zebiao Zheng,Xing Liu,Hong Gu,Yanmei Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:facial identity features, preserving unique facial, unique facial identity, retouching requires removing, requires removing subtle
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
146. 【2603.01161】GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection
链接:https://arxiv.org/abs/2603.01161
作者:Durgesh Ameta,Ujjwal Mishra,Praful Hambarde,Amit Shukla
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:identify semantic differences, Selective State Space, State Space Models, aims to identify, identify semantic
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former's superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: this https URL
147. 【2603.01151】D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping
链接:https://arxiv.org/abs/2603.01151
作者:Haozhe Lou,Mingtong Zhang,Haoran Geng,Hanyang Zhou,Sicheng He,Zhiyuan Gao,Siheng Zhao,Jiageng Mao,Pieter Abbeel,Jitendra Malik,Daniel Seita,Yue Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:develop robotic systems, robotic systems, cost-effective and flexible, flexible platform, develop robotic
备注: ICLR 2026 Poster
点击查看摘要
Abstract:Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physically plausible digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap.
148. 【2603.01147】ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features
链接:https://arxiv.org/abs/2603.01147
作者:Jiamei Guo,Zhehao Duan,Maria Neiiendam,Dianye Huang,Nassir Navab,Zhongliang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:success critically depends, accurate needle placement, clinical practice, success critically, critically depends
备注: Accepted by IPCAI
点击查看摘要
Abstract:Purpose: Ultrasound-guided needle interventions are widely used in clinical practice, but their success critically depends on accurate needle placement, which is frequently hindered by the poor and intermittent visibility of needles in ultrasound images. Existing approaches remain limited by artifacts, occlusions, and low contrast, and often fail to support real-time continuous insertion. To overcome these challenges, this study introduces a robust real-time framework for continuous needle detection. Methods: We present ConVibNet, an extension of VibNet for detecting needles with significantly reduced visibility, addressing real-time, continuous needle tracking during insertion. ConVibNet leverages temporal dependencies across successive ultrasound frames to enable continuous estimation of both needle tip position and shaft angle in dynamic scenarios. To strengthen temporal awareness of needle-tip motion, we introduce a novel intersection-and-difference loss that explicitly leverages motion correlations across consecutive frames. In addition, we curated a dedicated dataset for model development and evaluation. Results: The performance of the proposed ConVibNet model was evaluated on our dataset, demonstrating superior accuracy compared to the baseline VibNet and UNet-LSTM models. Specifically, ConVibNet achieved a tip error of 2.80+-2.42 mm and an angle error of 1.69+-2.00 deg. These results represent a 0.75 mm improvement in tip localization accuracy over the best-performing baseline, while preserving real-time inference capability. Conclusion: ConVibNet advances real-time needle detection in ultrasound-guided interventions by integrating temporal correlation modeling with a novel intersection-and-difference loss, thereby improving accuracy and robustness and demonstrating high potential for integration into autonomous insertion systems.
Comments:
Accepted by IPCAI
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.01147 [cs.CV]
(or
arXiv:2603.01147v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.01147
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Dianye Huang [view email] [v1]
Sun, 1 Mar 2026 15:16:25 UTC (992 KB)
149. 【2603.01143】C-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning
链接:https://arxiv.org/abs/2603.01143
作者:Zhuo Chen,Shawn Young,Lijian Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:holds great promise, critical computational bottleneck, computational pathology holds, large vision-language models, pathology holds great
备注: 8 pages, 4 figures, 2 tables
点击查看摘要
Abstract:The application of large vision-language models to computational pathology holds great promise for diagnostic assistants but faces a critical computational bottleneck: the gigapixel scale of Whole Slide Images (WSIs). A single WSI typically contains over 105 patches, creating sequence lengths that exceed the constraints of standard Transformer architectures. Existing solutions often resort to spatial sampling, which risks discarding diagnostically critical evidence. To address this, we propose TC-SSA (Token Compression via Semantic Slot Aggregation), a learnable token compression framework that aggregates patch features into a fixed number of semantic slots. A gated routing module assigns patches to slots using sparse Top-2 routing, followed by weighted aggregation, enabling global slide coverage under a strict token budget. The resulting representation retains diagnostically relevant information while reducing the number of visual tokens to 1.7% of the original sequence. On the SlideBench(TCGA), our model achieves 78.34% overall accuracy and 77.14% on the diagnosis subset, outperforming sampling-based baselines under comparable token budgets. The method also generalizes to MIL classification, reaching AUC of 95.83% on TCGA-BRCA, 98.27% on TCGA-NSCLC and 79.80% on PANDA. These results suggest that learnable semantic aggregation provides an effective trade-off between efficiency and diagnostic performance for gigapixel pathology reasoning.
150. 【2603.01142】ArtLLM: Generating Articulated Assets via 3D LLM
链接:https://arxiv.org/abs/2603.01142
作者:Penghao Wang,Siyuan Xie,Hongyu Yan,Xianghui Yang,Jingwei Huang,Chunchao Guo,Jiayuan Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Creating interactive digital, Creating interactive, interactive digital environments, environments for gaming, simulation relies
备注:
点击查看摘要
Abstract:Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.
151. 【2603.01140】acher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers
链接:https://arxiv.org/abs/2603.01140
作者:Kuai Jiang,Zhaoyan Ding,Guijuan Zhang,Dianjie Lu,Zhuoran Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inadvertently learn spurious, learn spurious correlations, Conventional image denoising, Conventional image, inadvertently learn
备注:
点击查看摘要
Abstract:Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
152. 【2603.01125】Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
链接:https://arxiv.org/abs/2603.01125
作者:Chengtai Li,Yuting He,Jianfeng Ren,Ruibin Bai,Yitian Zhao,Heng Yu,Xudong Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Augmented Anomaly Contrastive, Anomaly Contrastive Learning, received significant attention, compositional visual relations, Anomaly Contrastive
备注: Accepted by IEEE Transactions on Multimedia
点击查看摘要
Abstract:While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.
153. 【2603.01124】ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
链接:https://arxiv.org/abs/2603.01124
作者:Xiwei Liu,Yulong Li,Xinlin Zhuang,Xuhui Li,Jianxu Chen,Haolin Yang,Imran Razzak,Yutong Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:localized pathological evidence, shown promising potential, clinical decision support, factual hallucinations due, Medical Vision-Language Models
备注:
点击查看摘要
Abstract:Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
154. 【2603.01116】Improved MambdaBDA Framework for Robust Building Damage Assessment Across Disaster Domains
链接:https://arxiv.org/abs/2603.01116
作者:Alp Eren Gençoğlu,Hazım Kemal Ekenel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reliable post-disaster building, building damage assessment, post-disaster building damage, Reliable post-disaster, severe class imbalance
备注:
点击查看摘要
Abstract:Reliable post-disaster building damage assessment (BDA) from satellite imagery is hindered by severe class imbalance, background clutter, and domain shift across disaster types and geographies. In this work, we address these problems and explore ways to improve the MambaBDA, the BDA network of ChangeMamba architecture, one of the most successful BDA models. The approach enhances the MambaBDA with three modular components: (i) Focal Loss to mitigate class imbalance damage classification, (ii) lightweight Attention Gates to suppress irrelevant context, and (iii) a compact Alignment Module to spatially warp pre-event features toward post-event content before decoding. We experiment on multiple satellite imagery datasets, including xBD, Pakistan Flooding, Turkey Earthquake, and Ida Hurricane, and conduct in-domain and crossdataset tests. The proposed modular enhancements yield consistent improvements over the baseline model, with 0.8% to 5% performance gains in-domain, and up to 27% on unseen disasters. This indicates that the proposed enhancements are especially beneficial for the generalization capability of the system.
155. 【2603.01115】GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation
链接:https://arxiv.org/abs/2603.01115
作者:Zhuonan Liang,Wei Guo,Jie Gan,Yaxuan Song,Runnan Chen,Hang Chang,Weidong Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, increasingly adopted, image analysis, medical image, medical
备注: 12 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at this https URL
156. 【2603.01111】DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
链接:https://arxiv.org/abs/2603.01111
作者:Yiming Ma,Hongkun Yang,Lionel Z. Wang,Bin Chen,Weizhi Xian,Jianzhi Teng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adapting pre-trained Vision-Language, Prompt learning, pre-trained Vision-Language Models, dominant paradigm, paradigm for adapting
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
157. 【2603.01108】GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation
链接:https://arxiv.org/abs/2603.01108
作者:Tajamul Ashraf,Abrar Ul Riyaz,Wasif Tak,Tavaheed Tariq,Sonia Yadav,Moloud Abdar,Janibul Bashir
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:context-aware intraoperative assistance, instrument handoff guidance, workflow-aware robotic support, Clinically reliable perception, collision avoidance
备注: [this https URL](https://github.com/gaash-lab/GroundedSurg)
点击查看摘要
Abstract:Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at this https URL
158. 【2603.01104】Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI
链接:https://arxiv.org/abs/2603.01104
作者:Sicheng Yang,Yukai Huang,Weitong Cai,Shitong Sun,Fengyi Fang,You He,Yiqiao Xie,Jiankang Deng,Hang Zhang,Jifei Song,Zhensong Zhang
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:smart glasses, require a screen, stable desk, free hands, Large Language Model
备注: 14 pages, 6 figures, WWW 2026
点击查看摘要
Abstract:What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses and uses a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools. An egocentric reasoning core combines Temporal Chain-of-Thought with Hierarchical Context Compression to support long-horizon question answering and decision support over continuous first-person video, far beyond a single model's context window. Additionally, a lightweight multimodal intent layer maps noisy speech and gaze into structured commands. We further implement and evaluate a cloud-native WebRTC pipeline integrating streaming speech, video, and control messages into a unified channel for smart glasses and browsers. In parallel, we deploy an on-premise WebSocket baseline, exposing concrete trade-offs between local inference and cloud offloading in terms of latency, mobility, and resource use. Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance, and a human-in-the-loop study on smart glasses shows higher task completion and user satisfaction than leading commercial baselines. Taken together, these results indicate that web-connected egocentric co-pilots can be a practical path toward more accessible, context-aware assistance in everyday life. By grounding operation in web-native communication primitives and modular, auditable tool use, Egocentric Co-Pilot offers a concrete blueprint for assistive, always-on web agents that support education, accessibility, and social inclusion for people who may benefit most from contextual, egocentric AI.
159. 【2603.01103】Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting
链接:https://arxiv.org/abs/2603.01103
作者:Dantong Qin,Alessandro Bozzon,Xian Yang,Xun Zhang,Yike Guo,Pan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural image data, creative multimedia systems, systems are built, difficult to collect, collect at scale
备注:
点击查看摘要
Abstract:Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.
160. 【2603.01099】HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
链接:https://arxiv.org/abs/2603.01099
作者:Jiashu Li,Xumeng Han,Zhaoyang Wei,Zipeng Wang,Kuiran Wang,Guorong Li,Zhenjun Han,Jianbin Jiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, combining photorealistic rendering, view synthesis, combining photorealistic, real-time efficiency
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions, characterized by globally sparse coverage, blurred background, and distorted high-frequency areas. To address this, we propose HeroGS, Hierarchical Guidance for Robust 3D Gaussian Splatting, a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions. The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality. Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.
161. 【2603.01098】Differential privacy representation geometry for medical image analysis
链接:https://arxiv.org/abs/2603.01098
作者:Soroosh Tayebi Arasteh,Marziyeh Mohammadi,Sven Nebelung,Daniel Truhn
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Differential Privacy Representation, utility loss unclear, medical imaging, introduce Differential Privacy, Differential privacy
备注:
点击查看摘要
Abstract:Differential privacy (DP)'s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.
162. 【2603.01096】Unified Vision-Language Modeling via Concept Space Alignment
链接:https://arxiv.org/abs/2603.01096
作者:Yifu Qiu,Paul-Ambroise Duquenne,Holger Schwenk
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Omnilingual Embeddings Team, embedding space extended, Omnilingual Embeddings, embedding space SONAR, V-SONAR
备注: ICLR 2026
点击查看摘要
Abstract:We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
Comments:
ICLR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2603.01096 [cs.CV]
(or
arXiv:2603.01096v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.01096
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
163. 【2603.01083】Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective
链接:https://arxiv.org/abs/2603.01083
作者:Arctanx An,Shizhao Sun,Danqing Huang,Mingxi Cheng,Yan Gao,Ji Li,Yu Qiao,Jiang Bian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision language models, visual communication, central to visual, remains underexplored, underexplored in vision
备注: ICLR 2026
点击查看摘要
Abstract:Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design this http URL, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{this https URL}{this https URL}
164. 【2603.01082】Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
链接:https://arxiv.org/abs/2603.01082
作者:Xuan Lu,Kangle Li,Haohang Huang,Rui Meng,Wenjun Zeng,Xiaoyu Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Recent advances, multimodal large language, large language models, enabling systems, large language
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at this https URL
165. 【2603.01074】Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
链接:https://arxiv.org/abs/2603.01074
作者:Wangkai Li,Zhaoyang Li,Yuwen Pan,Rui Sun,Yujia Chen,Tianzhu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:conditions significantly degrade, introducing large distribution, weather conditions significantly, point cloud semantic, large distribution shifts
备注: Accepted by International Conference on Learning Representations (ICLR 2026)
点击查看摘要
Abstract:Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts. Existing augmentation-based methods attempt to enhance robustness by simulating weather interference during training. However, they struggle to fully exploit the potential of augmentations due to the trade-off between minor and aggressive augmentations. To address this, we propose A3Point, an adaptive augmentation-aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations. A3Point consists of two key components: semantic confusion prior (SCP) latent learning, which captures the model's inherent semantic confusion information, and semantic shift region (SSR) localization, which decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels. Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state-of-the-art results.
166. 【2603.01073】Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration
链接:https://arxiv.org/abs/2603.01073
作者:Yunguan Fu,Wenjia Bai,Wen Yan,Matthew J Clarkson,Rhodri Huw Davies,Yipeng Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion-based unsupervised image, expensive multi-step inference, multi-step inference limits, inference limits practical, unsupervised image registration
备注:
点击查看摘要
Abstract:Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Anonymized code is available at this https URL.
167. 【2603.01069】SHIELD8-UAV: Sequential 8-bit Hardware Implementation of a Precision-Aware 1D-F-CNN for Low-Energy UAV Acoustic Detection and Temporal Tracking
链接:https://arxiv.org/abs/2603.01069
作者:Susmita Ghanta,Karan Nathwani,Rohit Chaurasiya
类目:Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Numerical Analysis (math.NA)
关键词:Real-time unmanned aerial, unmanned aerial vehicle, Real-time unmanned, edge demands low-latency, demands low-latency inference
备注: Preprint of work submitted to ISVLSI 2026
点击查看摘要
Abstract:Real-time unmanned aerial vehicle (UAV) acoustic detection at the edge demands low-latency inference under strict power and hardware limits. This paper presents SHIELD8-UAV, a sequential 8-bit hardware implementation of a precision-aware 1D feature-driven CNN (1D-F-CNN) accelerator for continuous acoustic monitoring. The design performs layer-wise execution on a shared multi-precision datapath, eliminating the need for replicated processing elements. A layer-sensitivity quantisation framework supports FP32, BF16, INT8, and FXP8 modes, while structured channel pruning reduces the flattened feature dimension from 35,072 to 8,704 (75%), thereby lowering serialised dense-layer cycles. The model achieves 89.91% detection accuracy in FP32 with less than 2.5% degradation in 8-bit modes. The accelerator uses 2,268 LUTs and 0.94 W power with 116 ms end-to-end latency, achieving 37.8% and 49.6% latency reduction compared with QuantMAC and LPRE, respectively, on a Pynq-Z2 FPGA, and 5-9% lower logic usage than parallel designs. ASIC synthesis in UMC 40 nm technology shows a maximum operating frequency of 1.56 GHz, 3.29 mm2 core area, and 1.65 W total power. These results demonstrate that sequential execution combined with precision-aware quantisation and serialisation-aware pruning enables practical low-energy edge inference without relying on massive parallelism.
168. 【2603.01068】LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
链接:https://arxiv.org/abs/2603.01068
作者:Zebin You,Xiaolu Zhang,Jun Zhou,Chongxuan Li,Ji-Rong Wen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:length-adaptive omni diffusion, effective and length-adaptive, diffusion, multimodal understanding, omni diffusion
备注:
点击查看摘要
Abstract:We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at this https URL.
169. 【2603.01063】Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
链接:https://arxiv.org/abs/2603.01063
作者:Yuechen Luo,Qimao Chen,Fang Li,Shaoqing Xu,Jaxin Liu,Ziying Song,Zhi-xin Yang,Fuxi Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:plateau during Reinforcement, Reinforcement Learning, previous Supervised Fine-Tuning, Reinforcement, autonomous driving
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause -- whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
170. 【2603.01050】MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
链接:https://arxiv.org/abs/2603.01050
作者:Huanjin Yao,Qixiang Yin,Min Yang,Ziwang Zhao,Yibo Wang,Haotian Luo,Jingyi Zhang,Jiaxing Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:cross-modal information synthesis, search, multi-tool invocation, research agent capable, reasoning and planning
备注: Technical report
点击查看摘要
Abstract:We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at this https URL
171. 【2603.01038】From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
链接:https://arxiv.org/abs/2603.01038
作者:Haoyuan Zhang,Keyao Wang,Guosheng Zhang,Haixiao Yue,Zhiwen Tan,Siran Peng,Tianshuo Zhang,Xiao Tan,Kunbin Chen,Wei He,Jingdong Wang,Ajian Liu,Xiangyu Zhu,Zhen Lei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:robust Face Anti-Spoofing, Face recognition remains, recognition remains vulnerable, Face Anti-Spoofing, Face recognition
备注: Keywords: Biometrics, Face Anti-Spoofing, MLLM
点击查看摘要
Abstract:Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
172. 【2603.01036】SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network
链接:https://arxiv.org/abs/2603.01036
作者:Kuanxu Hou
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:efficiency directly determine, robot automated assembly, production quality, robot automated, efficiency directly
备注: snap assembly, snap detection and localization, object detection, multi-scale feature fusion, self-attention
点击查看摘要
Abstract:In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method's superiority in complex snap detection and localization tasks.
173. 【2603.01034】Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
链接:https://arxiv.org/abs/2603.01034
作者:Yangyang Xu,Junbo Ke,You-Wei Wen,Chao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:discrete forms defined, Implicit Neural Representations, Tensor Ring, high-order data modeling, powerful tool
备注: 22 pages, 18 figures, 12 tables. Accepted by CVPR 2026
点击查看摘要
Abstract:Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches. Code is available at this https URL.
174. 【2603.01029】Vision-Language Feature Alignment for Road Anomaly Segmentation
链接:https://arxiv.org/abs/2603.01029
作者:Zhuolin He,Jiacheng Tang,Jian Pu,Xiangyang Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Safe autonomous systems, identify unknown obstacles, complex environments require, environments require robust, require robust road
备注:
点击查看摘要
Abstract:Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and this http URL is released on this https URL.
175. 【2603.01028】Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
链接:https://arxiv.org/abs/2603.01028
作者:Junbo Ke,Yangyang Xu,You-Wei Wen,Chao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Implicit Neural Representations, Implicit Neural, signal processing tasks, capture high-frequency details, inherent spectral bias
备注: 21 pages, 22 figures, 8 tables
点击查看摘要
Abstract:Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods. Our code is available at this https URL.
176. 【2603.01026】RaUF: Learning the Spatial Uncertainty Field of Radar
链接:https://arxiv.org/abs/2603.01026
作者:Shengpeng Wang,Kuangyu Wang,Wei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe azimuth ambiguity, Millimeter-wave radar offers, offers unique advantages, clutter-induced spurious returns, radar offers unique
备注:
点击查看摘要
Abstract:Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios.
177. 【2603.01016】Implementation of Licensed Plate Detection and Noise Removal in Image Processing
链接:https://arxiv.org/abs/2603.01016
作者:Yiquan Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
关键词:Car license plate, license plate recognition, license plate, Car license, plate recognition
备注: 13 pages. This is the author's version, accepted manuscript
点击查看摘要
Abstract:Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.
178. 【2603.01010】GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
链接:https://arxiv.org/abs/2603.01010
作者:Xuqin Wang,Tao Wu,Yanfeng Zhang,Lu Liu,Mingwei Sun,Yongliang Wang,Niclas Zeller,Daniel Cremers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:viewpoints remains challenging, Recent advances, remains challenging, advances in generative, generative modeling
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.
179. 【2603.01007】Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
链接:https://arxiv.org/abs/2603.01007
作者:Xubo Zhu,Haoyang Zhang,Fei He,Rui Wu,Yanhu Shan,Wen Yang,Huai Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving perception, offering comprehensive geometric, comprehensive geometric scene, geometric scene understanding, driving perception
备注: 10 pages, 6 figures. Accepted at CVPR 2026
点击查看摘要
Abstract:3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose this http URL, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{this http URL} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.
180. 【2603.01000】Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
链接:https://arxiv.org/abs/2603.01000
作者:Yuze Li,Dong Gong,Xiao Cao,Junchao Yuan,Dongsheng Li,Lei Zhou,Yun Sing Koh,Cheng Yan,Xinyu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing methods largely, methods largely focus, distinct motion patterns, controllable video generation, require distinct motion
备注: 15 pages, 11 figures, see [this https URL](https://ethan-li123.github.io/FlexiMMT_page/)
点击查看摘要
Abstract:Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
181. 【2603.00990】MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation
链接:https://arxiv.org/abs/2603.00990
作者:Yi Zhang,Puxun Tu,Kun Wang,Yulin Yan,Tao Ying,Xiaojun Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:marker-based systems demand, demand prohibitive costs, intrusive sensor attachment, severe cumulative drift, systems demand prohibitive
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.
182. 【2603.00988】Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality
链接:https://arxiv.org/abs/2603.00988
作者:Danfeng Hong,Chenyu Li,Xuyang Li,Gustau Camps-Valls,Jocelyn Chanussot
类目:Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
关键词:Foundation models, Remote sensing, models, Foundation, techniques are increasingly
备注: Accepted by IEEE GRSM
点击查看摘要
Abstract:Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.
183. 【2603.00985】he Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers
链接:https://arxiv.org/abs/2603.00985
作者:Jiaqi Tang,Weixuan Xu,Shu Zhang,Fandong Zhang,Qingchao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, data-hungry nature clashes, clinical archives, data-hungry nature, nature clashes
备注:
点击查看摘要
Abstract:Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.
184. 【2603.00983】Event-Anchored Frame Selection for Effective Long-Video Understanding
链接:https://arxiv.org/abs/2603.00983
作者:Wang Chen,Yongdong Luo,Yuhui Zeng,Luojun Lin,Tianyu Xie,Fei Chao,Rongrong Ji,Xiawu Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Massive frame redundancy, large vision-language models, limited context window, context window make, window make efficient
备注:
点击查看摘要
Abstract:Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.
185. 【2603.00979】Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation
链接:https://arxiv.org/abs/2603.00979
作者:Jiaqi Tang,Mengyan Zheng,Shu Zhang,Fandong Zhang,Qingchao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, require massive annotated, massive annotated datasets, require massive, massive annotated
备注:
点击查看摘要
Abstract:Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL's infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74\% and up to 1.66\%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.
186. 【2603.00978】EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization
链接:https://arxiv.org/abs/2603.00978
作者:Zhaoxin Fan,Nanxiang Jiang,Daiheng Gao,Shiji Zhou,Wenjun Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Removing undesired concepts, Removing undesired, OpenSora employ flow-matching, long-horizon video generation, Stable Diffusion
备注:
点击查看摘要
Abstract:Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.
187. 【2603.00976】PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
链接:https://arxiv.org/abs/2603.00976
作者:Jiangshan Wang,Kang Zhao,Jiayi Guo,Jiayu Wang,Hang Guo,Chenyang Zhu,Xiu Li,Xiangyu Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High computational costs, video generation models, High computational, slow inference hinder, computational costs
备注: ICLR 2026
点击查看摘要
Abstract:High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, such as achieving an average of $2.6\times$ speedup on Wan2.1-14B without noticeable quality loss.
188. 【2603.00952】Decoupling Motion and Geometry in 4D Gaussian Splatting
链接:https://arxiv.org/abs/2603.00952
作者:Yi Zhang,Yulei Kang,Jian-Fang Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-fidelity reconstruction, Gaussian Splatting, Gaussian motion, challenging problem, Gaussian
备注:
点击查看摘要
Abstract:High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.
189. 【2603.00951】When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning
链接:https://arxiv.org/abs/2603.00951
作者:Joshua Steier
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:trains Vision Transformers, learning trains Vision, Vision Transformers layer, Vision Transformers, supervised contrastive objectives
备注: 17 pages, 2 figures, 15 tables, including appendices
点击查看摘要
Abstract:Contrastive Forward-Forward (CFF) learning trains Vision Transformers layer by layer against supervised contrastive objectives. CFF training can be sensitive to random seed, but the sources of this instability are poorly understood. We focus on one implementation detail: the positive-pair margin in the contrastive loss is applied through saturating similarity clamping, $\min(s + m,\, 1)$. We prove that an alternative formulation, subtracting the margin after the log-probability, is gradient-neutral under the mean-over-positives reduction. On CIFAR-10 ($2 \times 2$ factorial, $n{=}7$ seeds per cell), clamping produces $5.90\times$ higher pooled test-accuracy variance ($p{=}0.003$) with no difference in mean accuracy. Analyses of clamp activation rates, layerwise gradient norms, and a reduced-margin probe point to saturation-driven gradient truncation at early layers. The effect does not transfer cleanly to other datasets: on CIFAR-100, SVHN, and Fashion-MNIST, clamping produces equal or lower variance. Two factors account for the discrepancy. First, positive-pair density per batch controls how often saturation occurs. Second, task difficulty compresses seed-to-seed spread when accuracy is high. An SVHN difficulty sweep confirms the interaction on a single dataset, with the variance ratio moving from $0.25\times$ at high accuracy to $16.73\times$ under aggressive augmentation. In moderate-accuracy regimes with many same-class pairs per batch, switching to the gradient-neutral subtraction reference removes this variance inflation at no cost to mean accuracy. Measuring the layer-0 clamp activation rate serves as a simple check for whether the problem applies.
190. 【2603.00949】StegoNGP: 3D Cryptographic Steganography using Instant-NGP
链接:https://arxiv.org/abs/2603.00949
作者:Wenxiang Jiang,Yujun Lan,Shuo Zhao,Yuanshan Liu,Mingzhu Zhou,Jinxin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Instant Neural Graphics, Neural Graphics Primitives, Graphics Primitives, achieved significant success, securely embedding high-capacity
备注:
点击查看摘要
Abstract:Recently, Instant Neural Graphics Primitives (Instant-NGP) has achieved significant success in rapid 3D scene reconstruction, but securely embedding high-capacity hidden data, such as an entire 3D scene, remains a challenge. Existing methods rely on external decoders, require architectural modifications, and suffer from limited capacity, which makes them easily detectable. We propose a novel parameter-free 3D Cryptographic Steganography using Instant-NGP (StegoNGP), which leverages the Instant-NGP hash encoding function as a key-controlled scene switcher. By associating a default key with a cover scene and a secret key with a hidden scene, our method trains a single model to interweave both representations within the same network weights. The resulting model is indistinguishable from a standard Instant-NGP in architecture and parameter count. We also introduce an enhanced Multi-Key scheme, which assigns multiple independent keys across hash levels, dramatically expanding the key space and providing high robustness against partial key disclosure attacks. Experimental results demonstrated that StegoNGP can hide a complete high-quality 3D scene with strong imperceptibility and security, providing a new paradigm for high-capacity, undetectable information hiding in neural fields. The code can be found at this https URL.
191. 【2603.00947】\textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On
链接:https://arxiv.org/abs/2603.00947
作者:Zhenchen Wan,Ce Chen,Runqi Lin,Jiaxin Huang,Tianxi Chen,Yanwu Xu,Tongliang Liu,Mingming Gong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:impressive visual fidelity, raising privacy concerns, recently achieved impressive, achieved impressive visual, existing systems require
备注: The project page is available at: [this https URL](https://zhenchenwan.github.io/Mobile-VTON/)
点击查看摘要
Abstract:Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
192. 【2603.00938】Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
链接:https://arxiv.org/abs/2603.00938
作者:Shreshth Saini,Bowen Chen,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:High Dynamic Range, Standard Dynamic Range, Dynamic Range, High Dynamic, Standard Dynamic
备注:
点击查看摘要
Abstract:High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
193. 【2603.00931】Learning to Weigh Waste: A Physics-Informed Multimodal Fusion Framework and Large-Scale Dataset for Commercial and Industrial Applications
链接:https://arxiv.org/abs/2603.00931
作者:Md. Adnanul Islam,Wasimul Karim,Md Mahbub Alam,Subhey Sadi Rahman,Md. Abdur Rahman,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Kheng Cher Yeo,Deepika Mathur,Sami Azam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate weight estimation, image-based estimation remains, estimation remains difficult, Multimodal Weight Predictor, Accurate weight
备注:
点击查看摘要
Abstract:Accurate weight estimation of commercial and industrial waste is important for efficient operations, yet image-based estimation remains difficult because similar-looking objects may have different densities, and the visible size changes with camera distance. Addressing this problem, we propose Multimodal Weight Predictor (MWP) framework that estimates waste weight by combining RGB images with physics-informed metadata, including object dimensions, camera distance, and camera height. We also introduce Waste-Weight-10K, a real-world dataset containing 10,421 synchronized image-metadata collected from logistics and recycling sites. The dataset covers 11 waste categories and a wide weight range from 3.5 to 3,450 kg. Our model uses a Vision Transformer for visual features and a dedicated metadata encoder for geometric and category information, combining them with Stacked Mutual Attention Fusion that allows visual and physical cues guide each other. This helps the model manage perspective effects and link objects to material properties. To ensure stable performance across the wide weight range, we train the model using Mean Squared Logarithmic Error. On the test set, the proposed method achieves 88.06 kg Mean Absolute Error (MAE), 6.39% Mean Absolute Percentage Error (MAPE), and an R2 coefficient of 0.9548. The model shows strong accuracy for light objects in the 0-100 kg range with 2.38 kg MAE and 3.1% MAPE, maintaining reliable performance for heavy waste in the 1000-2000 kg range with 11.1% MAPE. Finally, we incorporate a physically grounded explanation module using Shapley Additive Explanations (SHAP) and a large language model to provide clear, human-readable explanations for each prediction.
194. 【2603.00925】he Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
链接:https://arxiv.org/abs/2603.00925
作者:Li Lucy,Albert Zhang,Nathan Anderson,Ryan Knight,Kyle Lo
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:Effective mathematics education, Effective mathematics, identifying and responding, Effective, students' mistakes
备注: 15 pages, 10 figures
点击查看摘要
Abstract:Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
195. 【2603.00919】DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
链接:https://arxiv.org/abs/2603.00919
作者:Zhiye Wang,Yanbo Jiang,Rui Zhou,Bo Zhang,Fang Zhang,Zhenhua Xu,Yaqin Zhang,Jianqiang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:shown great promise, Large language models, autonomous driving systems, Large language, shown great
备注: The project page is available at [this https URL](https://shiftwilliam.github.io/DriveCode)
点击查看摘要
Abstract:Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
196. 【2603.00918】Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
链接:https://arxiv.org/abs/2603.00918
作者:Seungwook Kim,Minsu Cho
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:powers content creation, generation powers content, creation across design, data augmentation, powers content
备注: 19 pages, accepted to CVPR 2026. Project page [this https URL](https://wookiekim.github.io/ARC/)
点击查看摘要
Abstract:Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.
197. 【2603.00912】VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
链接:https://arxiv.org/abs/2603.00912
作者:Yang Cao,Feize Wu,Dave Zhenyu Chen,Yingji Zhong,Lanqing Hong,Dan Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current multi-view indoor, precisely calibrated multi-view, calibrated multi-view camera, fuse multi-view information, global scene representation
备注: Accepted by CVPR 2026. Code Page: [this https URL](https://github.com/yangcaoai/VGGT-Det-CVPR2026)
点击查看摘要
Abstract:Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
198. 【2603.00911】On the Exact Algorithmic Extraction of Finite Tesselations Through Prime Extraction of Minimal Representative Forms
链接:https://arxiv.org/abs/2603.00911
作者:Sushish Baral,Paulo Garcia,Warisa Sritriratanarak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse computational domains, synthesis and structural, structural optimization, optimization across diverse, diverse computational
备注:
点击查看摘要
Abstract:The identification of repeating patterns in discrete grids is rudimentary within symbolic reasoning, algorithm synthesis and structural optimization across diverse computational domains. Although statistical approaches targeting noisy data can approximately recognize patterns, symbolic analysis utilizing deterministic extraction of periodic structures is underdeveloped. This paper aims to fill this gap by employing a hierarchical algorithm that discovers exact tessellations in finite planar grids, addressing the problem where multiple independent patterns may coexist within a hierarchical structure. The proposed method utilizes composite discovery (dual inspection and breadth-first pruning) for identifying rectangular regions with internal repetition, normalization to a minimal representative form, and prime extraction (selective duplication and hierarchical memoization) to account for irregular dimensions and to achieve efficient computation time. We evaluate scalability on grid sizes from 2x2 to 32x32, showing overlap detection on simple repeating tiles exhibits processing time under 1ms, while complex patterns which require exhaustive search and systematic exploration shows exponential growth. This algorithm provides deterministic behavior for exact, axis-aligned, rectangular tessellations, addressing a critical gap in symbolic grid analysis techniques, applicable to puzzle solving reasoning tasks and identification of exact repeating structures in discrete symbolic domains.
199. 【2603.00908】UD-SfPNet: An Underwater Descattering Shape-from-Polarization Network for 3D Normal Reconstruction
链接:https://arxiv.org/abs/2603.00908
作者:Puyun Wang,Kaimin Yu,Huayang He,Feng Huang,Xianyu Wu,Yating Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unique dual advantages, polarization imaging offers, severely hindered, offers the unique, unique dual
备注:
点击查看摘要
Abstract:Underwater optical imaging is severely hindered by scattering, but polarization imaging offers the unique dual advantages of descattering and shape-from-polarization (SfP) 3D reconstruction. To exploit these advantages, this paper proposes UD-SfPNet, an underwater descattering shape-from-polarization network that leverages polarization cues for improved 3D surface normal prediction. The framework jointly models polarization-based image descattering and SfP normal estimation in a unified pipeline, avoiding error accumulation from sequential processing and enabling global optimization across both tasks. UD-SfPNet further incorporates a novel color embedding module to enhance geometric consistency by exploiting the relationship between color encodings and surface orientation. A detail enhancement convolution module is also included to better preserve high-frequency geometric details that are lost under scattering. Experiments on the MuS-Polar3D dataset show that the proposed method significantly improves reconstruction accuracy, achieving a mean surface normal angular error of 15.12$^\circ$ (the lowest among compared methods). These results confirm the efficacy of combining descattering with polarization-based shape inference, and highlight the practical significance and potential applications of UD-SfPNet for optical 3D imaging in challenging underwater environments. The code is available at this https URL.
200. 【2603.00906】ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
链接:https://arxiv.org/abs/2603.00906
作者:Xiaolong Zeng,Yitong Yu,Shiyao Xiong,Jinhua Hao,Ming Sun,Chao Zhou,Bin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Look-Up Table based, Table based methods, Look-Up Table, Table based, image restoration tasks
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.
201. 【2603.00905】pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
链接:https://arxiv.org/abs/2603.00905
作者:Zhanpeng Luo,Ce Zhang,Silong Yong,Cunxi Dai,Qianwei Wang,Haoxi Ran,Guanya Shi,Katia Sycara,Yaqi Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Language Models
备注: Accepted at ICLR 2026, Project Page: Our project: [this https URL](https://pySpatial.github.io)
点击查看摘要
Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
202. 【2603.00887】VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
链接:https://arxiv.org/abs/2603.00887
作者:Longmi Gao,Pan Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Volume Electron Microscopy, Volume Electron, Electron Microscopy, poor axial resolution, produces anisotropic data
备注:
点击查看摘要
Abstract:Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel 3D Dependency Reordering paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves highly competitive performance across various metrics while maintaining a lower computational footprint. The source code is available on GitHub: this https URL
203. 【2603.00881】Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos
链接:https://arxiv.org/abs/2603.00881
作者:Yu Luo,Guangyu Wei,Yangfan Li,Jieyu He,Yueming Lyu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:coronary artery diseases, main coronary artery, coronary artery, X-ray coronary angiography, artery diseases
备注: 10 pages, 3 figures
点击查看摘要
Abstract:Segmentation of the main coronary artery from X-ray coronary angiography (XCA) sequences is crucial for the diagnosis of coronary artery diseases. However, this task is challenging due to issues such as blurred boundaries, inconsistent radiation contrast, complex motion patterns, and a lack of annotated images for training. Although Semi-Supervised Learning (SSL) can alleviate the annotation burden, conventional methods struggle with complicated temporal dynamics and unreliable uncertainty quantification. To address these challenges, we propose SAM3-based Teacher-student framework with Motion-Aware consistency and Progressive Confidence Regularization (SMART), a semi-supervised vessel segmentation approach for X-ray angiography videos. First, our method utilizes SAM3's unique promptable concept segmentation design and innovates a SAM3-based teacher-student framework to maximize the performance potential of both the teacher and the student. Second, we enhance segmentation by integrating the vessel mask warping technique and motion consistency loss to model complex vessel dynamics. To address the issue of unreliable teacher predictions caused by blurred boundaries and minimal contrast, we further propose a progressive confidence-aware consistency regularization to mitigate the risk of unreliable outputs. Extensive experiments on three datasets of XCA sequences from different institutions demonstrate that SMART achieves state-of-the-art performance while requiring significantly fewer annotations, making it particularly valuable for real-world clinical applications where labeled data is scarce. Our code is available at: this https URL.
204. 【2603.00878】MMTA: Multi Membership Temporal Attention for Fine-Grained Stroke Rehabilitation Assessment
链接:https://arxiv.org/abs/2603.00878
作者:Halil Ismail Helvaci,Justin Huber,Jihye Bae,Sen-ching Samson Cheung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:daily activities requires, activities requires temporally, requires temporally precise, temporally precise segmentation, iterative assessments involved
备注:
点击查看摘要
Abstract:To empower the iterative assessments involved during a person's rehabilitation, automated assessment of a person's abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.
205. 【2603.00870】PPC-MT: Parallel Point Cloud Completion with Mamba-Transformer Hybrid Architecture
链接:https://arxiv.org/abs/2603.00870
作者:Jie Li,Shengwei Tian,Long Yu,Xin Ning
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing point cloud, Existing point, point cloud completion, Principal Component Analysis, cloud completion
备注: Submitted to IEEE TPAMI
点击查看摘要
Abstract:Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. To address this, we propose PPC-MT, a novel parallel framework for point cloud completion leveraging a hybrid Mamba-Transformer architecture. Our approach introduces an innovative parallel completion strategy guided by Principal Component Analysis (PCA), which imposes a geometrically meaningful structure on unordered point clouds, transforming them into ordered sets and decomposing them into multiple subsets. These subsets are reconstructed in parallel using a multi-head reconstructor. This structured parallel synthesis paradigm significantly enhances the uniformity of point distribution and detail fidelity, while preserving computational efficiency. By integrating Mamba's linear complexity for efficient feature extraction during encoding with the Transformer's capability to model fine-grained multi-sequence relationships during decoding, PPC-MT effectively balances efficiency and reconstruction accuracy. Extensive quantitative and qualitative experiments on benchmark datasets, including PCN, ShapeNet-55/34, and KITTI, demonstrate that PPC-MT outperforms state-of-the-art methods across multiple metrics, validating the efficacy of our proposed framework.
206. 【2603.00853】Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration and Enhancement
链接:https://arxiv.org/abs/2603.00853
作者:Cong Wang,Jinshan Pan,Liyan Wang,Wei Wang,Yang Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:neural discrimination-prompted Transformer, Neural Discrimination-Prompted Network, discrimination-prompted Transformer, neural discrimination-prompted, Neural Discrimination Priors
备注: Accepted by IJCV'26; code is available at [this https URL](https://github.com/supersupercong/uhdpromer)
点击查看摘要
Abstract:We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement. Our UHDPromer is inspired by an interesting observation that there implicitly exist neural differences between high-resolution and low-resolution features, and exploring such differences can facilitate low-resolution feature representation. To this end, we first introduce Neural Discrimination Priors (NDP) to measure the differences and then integrate NDP into the proposed Neural Discrimination-Prompted Attention (NDPA) and Neural Discrimination-Prompted Network (NDPN). The proposed NDPA re-formulates the attention by incorporating NDP to globally perceive useful discrimination information, while the NDPN explores a continuous gating mechanism guided by NDP to selectively permit the passage of beneficial content. To enhance the quality of restored images, we propose a super-resolution-guided reconstruction approach, which is guided by super-resolving low-resolution features to facilitate final UHD image restoration. Experiments show that UHDPromer achieves the best computational efficiency while still maintaining state-of-the-art performance on $3$ UHD image restoration and enhancement tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes and pre-trained models will be made available at this https URL.
207. 【2603.00828】MME: Mixture of Mesh Experts with Random Walk Transformer Gating
链接:https://arxiv.org/abs/2603.00828
作者:Amir Belder,Ayellet Tal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:offering distinct advantages, recent years, offering distinct, distinct advantages, object classes
备注:
点击查看摘要
Abstract:In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert's decision-making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade-off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state-of-the-art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: this https URL.
208. 【2603.00825】COMBAT: Conditional World Models for Behavioral Agent Training
链接:https://arxiv.org/abs/2603.00825
作者:Anmol Agarwal,Pranay Meshram,Sumer Singh,Saurav Suman,Andrew Lapp,Shahbuland Matiana,Louis Castricato,Spencer Frazier
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, capable of simulating, environments and interactions, static objects, advances in video
备注:
点击查看摘要
Abstract:Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.00825 [cs.CV]
(or
arXiv:2603.00825v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.00825
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
209. 【2603.00805】NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
链接:https://arxiv.org/abs/2603.00805
作者:Seemandhar Jain,Keshav Gupta,Kunal Gupta,Manmohan Chandraker
类目:Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:neural radiance field, requires significant efforts, research requires significant, radiance field, proliferation of neural
备注: Accepted to CVPR 2026. Project page: [this https URL](https://seemandhar.github.io/NERFIFY/)
点击查看摘要
Abstract:The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.
210. 【2603.00793】Neural Functional Alignment Space: Brain-Referenced Representation of Artificial Neural Networks
链接:https://arxiv.org/abs/2603.00793
作者:Ruiyu Yan,Hanqi Jiang,Yi Pan,Xiaobo Li,Tianming Liu,Xi Jiang,Lin Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:equal functional grounds, Functional Alignment Space, Neural Functional Alignment, functional grounds, characterizing artificial neural
备注:
点击查看摘要
Abstract:We propose the Neural Functional Alignment Space (NFAS), a brain-referenced representational framework for characterizing artificial neural networks on equal functional grounds. NFAS departs from conventional alignment approaches that rely on layer-wise features or task-specific activations by modeling the intrinsic dynamical evolution of stimulus representations across network depth. Specifically, we model layer-wise embeddings as a depth-wise dynamical trajectory and apply Dynamic Mode Decomposition (DMD) to extract the stable mode. This representation is then projected into a biologically anchored coordinate system defined by distributed neural responses. We also introduce the Signal-to-Noise Consistency Index (SNCI) to quantify cross-model consistency at the modality level. Across 45 pretrained models spanning vision, audio, and language, NFAS reveals structured organization within this brain-referenced space, including modality-specific clustering and cross-modal convergence in integrative cortical systems. Our findings suggest that representation dynamics provide a principled basis for
211. 【2603.00777】DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents
链接:https://arxiv.org/abs/2603.00777
作者:Zikang Xu,Ruinan Jin,Xiaoxiao Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Chest X-ray agents, Tool-using medical agents, improve chest X-ray, chest X-ray question, X-ray question answering
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: this https URL
212. 【2603.00763】Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models
链接:https://arxiv.org/abs/2603.00763
作者:Zhenyu Zhou,Defang Chen,Siwei Lyu,Chun Chen,Can Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved unprecedented success, limited sampling budgets, achieved unprecedented, unprecedented success, results under limited
备注:
点击查看摘要
Abstract:Text-to-image diffusion models have achieved unprecedented success but still struggle to produce high-quality results under limited sampling budgets. Existing training-free sampling acceleration methods are typically developed independently, leaving the overall performance and compatibility among these methods unexplored. In this paper, we bridge this gap by systematically elucidating the design space, and our comprehensive experiments identify the sampling time schedule as the most pivotal factor. Inspired by the geometric properties of diffusion models revealed through the Frenet-Serret formulas, we propose constant total rotation schedule (TORS), a scheduling strategy that ensures uniform geometric variation along the sampling trajectory. TORS outperforms previous training-free acceleration methods and produces high-quality images with 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Extensive experiments underscore the adaptability of our method to unseen models, hyperparameters, and downstream applications.
213. 【2603.00756】Stroke outcome and evolution prediction from CT brain using a spatiotemporal diffusion autoencoder
链接:https://arxiv.org/abs/2603.00756
作者:Adam Marcus,Paul Bentley,Daniel Rueckert
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:disability worldwide, death and disability, revolutionize stroke care, Stroke, Computed Tomography
备注: Accepted in The 6th International Workshop on Machine Learning in Clinical Neuroimaging (MLCN 2023)
点击查看摘要
Abstract:Stroke is a major cause of death and disability worldwide. Accurate outcome and evolution prediction has the potential to revolutionize stroke care by individualizing clinical decision-making leading to better outcomes. However, despite a plethora of attempts and the rich data provided by neuroimaging, modelling the ultimate fate of brain tissue remains a challenging task. In this work, we apply recent ideas in the field of diffusion probabilistic models to generate a self-supervised semantically meaningful stroke representation from Computed Tomography (CT) images. We then improve this representation by extending the method to accommodate longitudinal images and the time from stroke onset. The effectiveness of our approach is evaluated on a dataset consisting of 5,824 CT images from 3,573 patients across two medical centers with minimal labels. Comparative experiments show that our method achieves the best performance for predicting next-day severity and functional outcome at discharge.
214. 【2603.00755】BornoViT: A Novel Efficient Vision Transformer for Bengali Handwritten Basic Characters Classification
链接:https://arxiv.org/abs/2603.00755
作者:Rafi Hassan Chowdhury,Naimul Haque,Kaniz Fatiha
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:significant challenge due, Handwritten character classification, Bengali handwritten character, Bengali handwritten, significant challenge
备注:
点击查看摘要
Abstract:Handwritten character classification in the Bengali script is a significant challenge due to the complexity and variability of the characters. The models commonly used for classification are often computationally expensive and data-hungry, making them unsuitable for resource-limited languages such as Bengali. In this experiment, we propose a novel, efficient, and lightweight Vision Transformer model that effectively classifies Bengali handwritten basic characters and digits, addressing several shortcomings of traditional methods. The proposed solution utilizes a deep convolutional neural network (DCNN) in a more simplified manner compared to traditional DCNN architectures, with the aim of reducing computational burden. With only 0.65 million parameters, a model size of 0.62 MB, and 0.16 GFLOPs, our model, BornoViT, is significantly lighter than current state-of-the-art models, making it more suitable for resource-limited environments, which is essential for Bengali handwritten character classification. BornoViT was evaluated on the BanglaLekha Isolated dataset, achieving an accuracy of 95.77%, and demonstrating superior efficiency compared to existing state-of-the-art approaches. Furthermore, the model was evaluated on our self-collected dataset, Bornomala, consisting of approximately 222 samples from different age groups, where it achieved an accuracy of 91.51%.
215. 【2603.00732】UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
链接:https://arxiv.org/abs/2603.00732
作者:Zhenhao Zhang,Jiaxin Liu,Ye Shi,Jingya Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Planning physically feasible, Planning physically, dexterous hand manipulation, central challenge, challenge in robotic
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{this https URL}{this https URL}.
216. 【2603.00717】Diversity over Uniformity: Rethinking Representation in Generated Image Detection
链接:https://arxiv.org/abs/2603.00717
作者:Qinghui He,Haifeng Zhang,Qiao Qin,Bo Liu,Xiuli Bi,Bin Xiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generated image detection, visual forensics, rapid advancement, important task, task in visual
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at this https URL.
217. 【2603.00714】A Reconstruction System for Industrial Pipeline Inner Walls Using Panoramic Image Stitching with Endoscopic Imaging
链接:https://arxiv.org/abs/2603.00714
作者:Rui Ma,Yifeng Wang,Ziteng Yang,Xinghui Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:industrial inspection scenarios, pipeline inner walls, walls remain challenging, inspection scenarios, pipeline
备注: 4 pages, 1 figure
点击查看摘要
Abstract:Visual analysis and reconstruction of pipeline inner walls remain challenging in industrial inspection scenarios. This paper presents a dedicated reconstruction system for pipeline inner walls via industrial endoscopes, which is built on panoramic image stitching technology. Equipped with a custom graphical user interface (GUI), the system extracts key frames from endoscope video footage, and integrates polar coordinate transformation with image stitching techniques to unwrap annular video frames of pipeline inner walls into planar panoramic images. Experimental results demonstrate that the proposed method enables efficient processing of industrial endoscope videos, and the generated panoramic stitched images preserve all detailed features of pipeline inner walls in their entirety. This provides intuitive and accurate visual support for defect detection and condition assessment of pipeline inner walls. In comparison with the traditional frame-by-frame video review method, the proposed approach significantly elevates the efficiency of pipeline inner wall reconstruction and exhibits considerable engineering application value.
218. 【2603.00711】IU: Imperceptible Universal Backdoor Attack
链接:https://arxiv.org/abs/2603.00711
作者:Hsin Lin,Yan-Lun Chen,Ren-Hung Hwang,Chia-Mu Yu
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deep neural networks, visually salient patterns, Backdoor attacks pose, salient patterns, making them easier
备注:
点击查看摘要
Abstract:Backdoor attacks pose a critical threat to the security of deep neural networks, yet existing efforts on universal backdoors often rely on visually salient patterns, making them easier to detect and less practical at scale. In this work, we introduce a novel imperceptible universal backdoor attack that simultaneously controls all target classes with minimal poisoning while preserving stealth. Our key idea is to leverage graph convolutional networks (GCNs) to model inter-class relationships and generate class-specific perturbations that are both effective and visually invisible. The proposed framework optimizes a dual-objective loss that balances stealthiness (measured by perceptual similarity metrics such as PSNR) and attack success rate (ASR), enabling scalable, multi-target backdoor injection. Extensive experiments on ImageNet-1K with ResNet architectures demonstrate that our method achieves high ASR (up to 91.3%) under poisoning rates as low as 0.16%, while maintaining benign accuracy and evading state-of-the-art defenses. These results highlight the emerging risks of invisible universal backdoors and call for more robust detection and mitigation strategies.
219. 【2603.00707】owards Khmer Scene Document Layout Detection
链接:https://arxiv.org/abs/2603.00707
作者:Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing,Masakazu Iwamura,Koichi Kise
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:language remains constrained, Khmer language remains, large multimodal models, Latin scripts, advanced significantly
备注: 17 pages, 7 figures, 6 tables
点击查看摘要
Abstract:While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).
220. 【2603.00702】owards Universal Khmer Text Recognition
链接:https://arxiv.org/abs/2603.00702
作者:Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing,Masakazu Iwamura,Koichi Kise
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-resource language characterized, presenting significant challenges, optical character recognition, complex script, low-resource language
备注: 17 pages, 9 figures, 6 tables
点击查看摘要
Abstract:Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.
221. 【2603.00697】okenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
链接:https://arxiv.org/abs/2603.00697
作者:Yihui Li,Chengxin Lv,Zichen Tang,Hongyu Yang,Di Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Token-aligned Gaussian Prediction, unposed multi-view images, Gaussian Prediction module, framework for joint, unposed multi-view
备注:
点击查看摘要
Abstract:We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images. At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space. Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians. To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement. Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods. Project page: this https URL.
222. 【2603.00695】STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification
链接:https://arxiv.org/abs/2603.00695
作者:Xingguo Xu,Zhanyu Liu,Weixiang Zhou,Yuansheng Gao,Junjie Cao,Yuhao Wang,Jixiang Luo,Dell Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:retrieve specific objects, exploit complementary information, Multi-modal object Re-Identification, object Re-Identification, aims to exploit
备注: Accepted to AAAI 2026
点击查看摘要
Abstract:Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public benchmarks (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.
223. 【2603.00687】SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion
链接:https://arxiv.org/abs/2603.00687
作者:Guoquan Wei,Liu Shi,Shaoyu Wang,Mohan Li,Cunfeng Wei,Qiegen Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:affecting disease diagnosis, fundamental challenge affecting, challenge affecting disease, computed tomography, disease diagnosis
备注:
点击查看摘要
Abstract:Noise and artifacts during computed tomography (CT) scans are a fundamental challenge affecting disease diagnosis. However, current methods either involve excessively long reconstruction times or rely on data-driven models for optimization, failing to adequately consider the valuable information inherent in the data itself, especially medical 3D data. This work proposes a reconstruction method under ultra-low raw data conditions, requiring no external data and avoiding lengthy pre-training processes. By leveraging spatial nonlocal similarity and the conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, high-fidelity results can be achieved in a very short time. Extensive experiments demonstrate that this method not only mitigates detector-induced ring artifacts but also exhibits unprecedented capabilities in detail recovery. This method provides a new paradigm for research using unlabeled raw projection data. Code is available at this https URL.
224. 【2603.00682】CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
链接:https://arxiv.org/abs/2603.00682
作者:Yushan Han,Hui Zhang,Qiming Xia,Yi Jin,Yidong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:overcome perception limitations, Collaborative perception empowers, perception empowers autonomous, share complementary information, empowers autonomous agents
备注: Accepted by CVPR'26
点击查看摘要
Abstract:Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communication-efficient early Collaborative perception framework that incorporates LiDAR Completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings. The code is available at this https URL.
225. 【2603.00675】Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis
链接:https://arxiv.org/abs/2603.00675
作者:Youngjin Yoo,Han Liu,Bogdan Georgescu,Yanbo Zhang,Sasa Grbic,Michael Baumgartner,Thomas J. Re,Jyotipriya Das,Poikavila Ullaskrishnan,Eva Eibenberger,Andrei Chekkoury,Uttam K. Bodanapally,Savvas Nicolaou,Pina C. Sanelli,Thomas J. Schroeppel,Yvonne W. Lui,Eli Gibson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:transfer learning capabilities, strong transfer learning, complex multi-label diagnostic, multi-label diagnostic tasks-such, finding detection-remains understudied
备注:
点击查看摘要
Abstract:Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.
226. 【2603.00668】Direct low-field MRI super-resolution using undersampled k-space
链接:https://arxiv.org/abs/2603.00668
作者:Daniel Tweneboah Anyimadu,Mohammed M. Abdelsamea,Ahmed Karam Eldaly
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:magnetic resonance imaging, Low-field magnetic resonance, limited image quality, magnetic resonance, affordable access
备注: 4 pages, 4 figures, conference (The IEEE International Symposium on Biomedical Imaging (ISBI))
点击查看摘要
Abstract:Low-field magnetic resonance imaging (MRI) provides affordable access to diagnostic imaging but suffers from prolonged acquisition and limited image quality. Accelerated imaging can be achieved with k-space undersampling, while super-resolution (SR) and image quality transfer (IQT) methods typically rely on spatial-domain post-processing. In this work, we propose a novel framework for reconstructing high-field MR like images directly from undersampled low-field k-space. Our approach employs a k-space dual channel U-Net that processes the real and imaginary components of undersampled k-space to restore missing frequency content. Experiments on low-field brain MRI demonstrate that our k-space-driven image enhancement consistently outperforms the counterpart spatial-domain method. Furthermore, reconstructions from undersampled k-space achieve image quality comparable to full k-space acquisitions. To the best of our knowledge, this is the first work that investigates low-field MRI SR/IQT directly from undersampled k-space.
227. 【2603.00667】Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
链接:https://arxiv.org/abs/2603.00667
作者:Wentao Huang,Weimin Lyu,Peiliang Lou,Qingqiao Hu,Xiaoling Hu,Shahira Abousamra,Wenchao Han,Ruifeng Guo,Jiawei Zhou,Chao Chen,Chen Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:domain-specific image encoders, recent years, driven by domain-specific, Computational pathology, advanced rapidly
备注: 14 pages, 8 figures. Accepted by CVPR'26
点击查看摘要
Abstract:Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
228. 【2603.00655】Stateful Cross-layer Vision Modulation
链接:https://arxiv.org/abs/2603.00655
作者:Ying Liu,Yudong Han,Kean Shi,Liyuan Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent multimodal large, widely adopt multi-layer, Recent multimodal, multimodal large language, adopt multi-layer visual
备注:
点击查看摘要
Abstract:Recent multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation. However, existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself. As a result, fine-grained details from early layers may be progressively suppressed during hierarchical abstraction. Moreover, directly introducing shallow-layer features into the language model often leads to semantic distribution mismatch with the visual feature space that the LLM's cross-attention layers were pretrained on, which typically requires additional adaptation or fine-tuning of the LLM. To address these limitations, we revisit visual representation learning from the perspective of representation evolution control and propose a cross-layer memory-modulated vision framework(SCVM). Specifically, we introduce a recursively updated cross-layer memory state inside the vision encoder to model long-range inter-layer dependencies. We further design a layer-wise feedback modulation mechanism that refreshes token representations at each layer based on the accumulated memory, thereby structurally regulating the representation evolution trajectory. In addition, we incorporate an auxiliary semantic alignment objective that explicitly supervises the final memory state, encouraging progressive compression and reinforcement of task-relevant information. Experimental results on multiple visual question answering and hallucination evaluation benchmarks demonstrate that SCVM achieves consistent performance improvements without expanding visual tokens, introducing additional vision encoders, or modifying or fine-tuning the language model.
229. 【2603.00654】RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception
链接:https://arxiv.org/abs/2603.00654
作者:Xiaokai Bai,Lianqing Zheng,Runwei Guan,Siyuan Cao,Huiliang Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enhances scene understanding, enhances scene, scene understanding, Collaborative perception, multi-agent information sharing
备注: 18 pages, 5 figures, 7 tables
点击查看摘要
Abstract:Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.
230. 【2603.00651】Exploring 3D Dataset Pruning
链接:https://arxiv.org/abs/2603.00651
作者:Xiaohan Zhao,Xinyi Shang,Jiacheng Liu,Zhiqiang Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:remain largely unexplored, data remain largely, images to remove, accelerate training, largely unexplored
备注: Code: [this https URL](https://github.com/XiaohanZhao123/3D-Dataset-Pruning)
点击查看摘要
Abstract:Dataset pruning has been widely studied for 2D images to remove redundancy and accelerate training, while particular pruning methods for 3D data remain largely unexplored. In this work, we study dataset pruning for 3D data, where its observed common long-tail class distribution nature make optimization under conventional evaluation metrics Overall Accuracy (OA) and Mean Accuracy (mAcc) inherently conflicting, and further make pruning particularly challenging. To address this, we formulate pruning as approximating the full-data expected risk with a weighted subset, which reveals two key errors: coverage error from insufficient representativeness and prior-mismatch bias from inconsistency between subset-induced class weights and target metrics. We propose representation-aware subset selection with per-class retention quotas for long-tail coverage, and prior-invariant teacher supervision using calibrated soft labels and embedding-geometry distillation. The retention quota also serves as a switch to control the OA-mAcc trade-off. Extensive experiments on 3D datasets show that our method can improve both metrics across multiple settings while adapting to different downstream preferences. Our code is available at this https URL.
231. 【2603.00643】Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered
链接:https://arxiv.org/abs/2603.00643
作者:Jinfan Hu,Fanghua Yu,Zhiyuan You,Xiang Yin,Hongyu An,Xinqi Lin,Chao Dong,Jinjin Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:quality assessment benchmarks, single-metric image quality, image quality assessment, position paper argues, assessment benchmarks
备注:
点击查看摘要
Abstract:This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.
232. 【2603.00624】IDER: IDempotent Experience Replay for Reliable Continual Learning
链接:https://arxiv.org/abs/2603.00624
作者:Zhanwang Liu,Yuting Li,Haoyuan Gao,Yexin Li,Linghe Kong,Lichao Sun,Weiran Huang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:forget previously learned, previously learned knowledge, Catastrophic forgetting, tendency of neural, neural networks
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Catastrophic forgetting, the tendency of neural networks to forget previously learned knowledge when learning new tasks, has been a major challenge in continual learning (CL). To tackle this challenge, CL methods have been proposed and shown to reduce forgetting. Furthermore, CL models deployed in mission-critical settings can benefit from uncertainty awareness by calibrating their predictions to reliably assess their confidences. However, existing uncertainty-aware continual learning methods suffer from high computational overhead and incompatibility with mainstream replay methods. To address this, we propose idempotent experience replay (IDER), a novel approach based on the idempotent property where repeated function applications yield the same output. Specifically, we first adapt the training loss to make model idempotent on current data streams. In addition, we introduce an idempotence distillation loss. We feed the output of the current model back into the old checkpoint and then minimize the distance between this reprocessed output and the original output of the current model. This yields a simple and effective new baseline for building reliable continual learners, which can be seamlessly integrated with other CL approaches. Extensive experiments on different CL benchmarks demonstrate that IDER consistently improves prediction reliability while simultaneously boosting accuracy and reducing forgetting. Our results suggest the potential of idempotence as a promising principle for deploying efficient and trustworthy continual learning systems in real-world this http URL code is available at this https URL.
233. 【2603.00611】Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
链接:https://arxiv.org/abs/2603.00611
作者:Lijing Cai,Zhan Shi,Chenglong Huang,Jinyao Wu,Qiping Li,Zikang Huo,Linsen Chen,Chongde Zi,Xun Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Spectral Compressive Imaging, Compressive Imaging, achieved remarkable success, unlocking significant potential, Spectral Compressive
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs. Project page: this https URL
234. 【2603.00609】Linking Modality Isolation in Heterogeneous Collaborative Perception
链接:https://arxiv.org/abs/2603.00609
作者:Changxing Liu,Zichen Chao,Siheng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Collaborative perception leverages, leverages data exchange, Collaborative perception, Collaborative, multiple agents
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released on this https URL.
235. 【2603.00607】IdGlow: Dynamic Identity Modulation for Multi-Subject Generation
链接:https://arxiv.org/abs/2603.00607
作者:Honghao Cai,Xiangyuan Wang,Yunhao Bai,Tianze Zhou,Sijie Xu,Yuyang Hao,Zezhou Cui,Yuyuan Yang,Wei Zhu,Yibo Chen,Xu Tang,Yao Hu,Zhen Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:seamlessly harmonizing multiple, harmonizing multiple reference, multiple reference identities, requires seamlessly harmonizing, image generation requires
备注:
点击查看摘要
Abstract:Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
236. 【2603.00604】Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation
链接:https://arxiv.org/abs/2603.00604
作者:Keiller Nogueira,Codrut-Andrei Diaconu,Dávid Kerekes,Jakob Gawlikowski,Cédric Léonard,Nassim Ait Ali Braham,June Moh Goo,Zichao Zeng,Zhipeng Liu,Pallavi Jain,Andrea Nascetti,Ronny Hänsch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-quality pixel-level annotations, High-quality pixel-level, remote sensing imagery, sensing imagery, remote sensing semantic
备注:
点击查看摘要
Abstract:High-quality pixel-level annotations are essential for the semantic segmentation of remote sensing imagery. However, such labels are expensive to obtain and often affected by noise due to the labor-intensive and time-consuming nature of pixel-wise annotation, which makes it challenging for human annotators to label every pixel accurately. Annotation errors can significantly degrade the performance and robustness of modern segmentation models, motivating the need for reliable mechanisms to identify and quantify noisy training samples. This paper introduces a novel Data-Centric benchmark, together with a novel, publicly available dataset and two techniques for identifying, quantifying, and ranking training samples according to their level of label noise in remote sensing semantic segmentation. Such proposed methods leverage complementary strategies based on model uncertainty, prediction consistency, and representation analysis, and consistently outperform established baselines across a range of experimental settings. The outcomes of this work are publicly available at this https URL.
237. 【2603.00595】UNICBench: UNIfied Counting Benchmark for MLLM
链接:https://arxiv.org/abs/2603.00595
作者:Chenggang Rong,Tao Han,Zhiyuan Zhao,Yaowu Fan,Jia Wan,Song Guo,Yuan Yuan,Junyu Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, unified counting dataset, language models, large language
备注: This paper has been accepted by CVPR 2026
点击查看摘要
Abstract:Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.
238. 【2603.00592】LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
链接:https://arxiv.org/abs/2603.00592
作者:Yuchen Hou,Lin Zhao
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:VLA models, VLA, language, VLA models largely, cs.RO
备注: 7 pages, 3 figures. Code and benchmark will be available at [this https URL](https://github.com/YC11Hou/langgap)
点击查看摘要
Abstract:Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in {\pi}0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap -- success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions -- precisely the long-term value of LangGap.
Comments:
7 pages, 3 figures. Code and benchmark will be available at this https URL
Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2603.00592 [cs.RO]
(or
arXiv:2603.00592v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2603.00592
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
239. 【2603.00589】AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
链接:https://arxiv.org/abs/2603.00589
作者:Cencen Liu(1),Dongyang Zhang(1 and 2),Wen Yin(1),Jielei Wang(1 and 2),Tianyu Li(1),Ji Guo(1),Wenbo Jiang(1),Guoqing Wang(1),Guoming Lu(1 and 2) ((1) University of Electronic Science and Technology of China, (2) Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:offering stable training, Hierarchical Consistency Constraint, models have recently, offering stable, stable training
备注: Accepted to CVPR 2026 Findings
点击查看摘要
Abstract:Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
240. 【2603.00586】WildActor: Unconstrained Identity-Preserving Video Generation
链接:https://arxiv.org/abs/2603.00586
作者:Qin Guo,Tianyu Yang,Xuanhua He,Fei Shen,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Dan Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires digital actors, maintain strictly consistent, strictly consistent full-body, consistent full-body identities, Production-ready human video
备注:
点击查看摘要
Abstract:Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level consistency, or produce copy-paste artifacts where subjects appear rigid due to pose locking. We present Actor-18M, a large-scale human video dataset designed to capture identity consistency under unconstrained viewpoints and environments. Actor-18M comprises 1.6M videos with 18M corresponding human images, covering both arbitrary views and canonical three-view representations. Leveraging Actor-18M, we propose WildActor, a framework for any-view conditioned human video generation. We introduce an Asymmetric Identity-Preserving Attention mechanism coupled with a Viewpoint-Adaptive Monte Carlo Sampling strategy that iteratively re-weights reference conditions by marginal utility for balanced manifold coverage. Evaluated on the proposed Actor-Bench, WildActor consistently preserves body identity under diverse shot compositions, large viewpoint transitions, and substantial motions, surpassing existing methods in these challenging settings.
241. 【2603.00585】MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation
链接:https://arxiv.org/abs/2603.00585
作者:Rongsheng Wang,Minghao Wu,Hongru Zhou,Zhihan Yu,Zhenyang Cai,Junying Chen,Benyou Wang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely unexplored, microscopic phenomena remains, phenomena remains largely, Recent advances, Microscale simulation
备注:
点击查看摘要
Abstract:Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ-on-chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce MicroWorldBench, a multi-level rubric-based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric-based evaluation through 459 unique expert-annotated criteria spanning multiple microscale simulation task (e.g., organ-level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria. To address these limitations, we construct MicroSim-10K, a high-quality, expert-verified simulation dataset. Leveraging this dataset, we train MicroVerse, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale mechanism. Our work first introduce the concept of Micro-World Simulation and present a proof of concept, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms. Our data and code are publicly available at this https URL
242. 【2603.00574】Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
链接:https://arxiv.org/abs/2603.00574
作者:Yongbo He,Zirun Guo,Tao Jin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:evolving test-time distributions, Adapting pretrained multi-modal, Adapting pretrained, biased modality, test-time distributions
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
243. 【2603.00565】MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs
链接:https://arxiv.org/abs/2603.00565
作者:Yilian Liu,Xiaojun Jia,Guoshun Nan,Jiuyang Lyu,Zhican Chen,Tao Guan,Shuyuan Luo,Zhongyi Zhai,Yang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Multimodal Large Language, Large Language Models, Large Language, achieved remarkable performance, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the model's reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model's security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed-source MLLMs. Our code is available at this [link](this https URL).
244. 【2603.00560】Geometry OR Tracker: Universal Geometric Operating Room Tracking
链接:https://arxiv.org/abs/2603.00560
作者:Yihua Shao,Kang Chen,Feng Xue,Siyu Chen,Long Bai,Hongyuan Yu,Hao Tang,Jinlin Wu,Nassir Navab
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:surgeon behavior recognition, supports downstream applications, physically meaningful quantities, tracking supports downstream, Multi-view Metric Geometry
备注:
点击查看摘要
Abstract:In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition, where physically meaningful quantities such as distances and motion statistics must be measured in meters. However, real clinical deployments rarely satisfy the geometric prerequisites for stable multi-view fusion and tracking: camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency that produces "ghosting" during fusion and degrades 3D trajectories in a shared OR coordinate frame. To address this, we introduce Geometry OR Tracker, a two-stage pipeline that first rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup with a single global scale via a Multi-view Metric Geometry Rectification module, and then performs Occlusion-Robust 3D Point Tracking directly in the unified OR world frame. On the MM-OR benchmark, improved geometric consistency translates into tracking gains: our rectification front-end reduces cross-view depth disagreement by more than 30$\times$ compared to raw calibration. Ablation studies further demonstrate the relationship between calibration quality and tracking accuracy, showing that improved geometric consistency yields stronger world-frame tracking.
245. 【2603.00550】Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
链接:https://arxiv.org/abs/2603.00550
作者:Yu Wang,Shengjie Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Weakly supervised video, Weakly supervised, involves identifying, supervisory signals, identifying the temporal
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.
246. 【2603.00546】Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
链接:https://arxiv.org/abs/2603.00546
作者:Zeyu Chen,Huanjin Yao,Ziwang Zhao,Min Yang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Using Multimodal Large Language Models (MLLMs) as judges to achieve precise and consistent evaluations has gradually become an emerging paradigm across various domains. Evaluating the capability and reliability of MLLM-as-a-judge systems is therefore essential for ensuring trustworthy assessment. Existing judge benchmarks categorize samples by task types but fail to capture the fundamental judgment capabilities required for reliable evaluation. In this work, we introduce M-JudgeBench, a ten-dimensional capability-oriented benchmark designed to comprehensively assess the judgment abilities of MLLMs. Our benchmark decomposes evaluation into pairwise Chain-of-Thought (CoT) comparison, length bias avoidance, and process error detection tasks, jointly covering ten fine-grained subtasks. This design enables diagnosis of model reliability across reasoning styles, response lengths, and cross-model variations. Systematic evaluation uncovers the systematic weaknesses in existing MLLM-as-a-judge systems. To address this issue, we further propose Judge-MCTS, a data construction framework generating pairwise reasoning trajectories with various correctness and length. Using Judge-MCTS, we construct an MCTS-augmented dataset and train M-Judger, a series of strong judge models. Extensive experiments demonstrate the superiority of M-Judger on existing judge benchmarks as well as M-JudgeBench. Overall, our work establishes a more principled foundation for evaluating MLLM-as-a-judge through M-JudgeBench and Judge-MCTS framework, paving the way for future research on judge model evaluation and capability-driven judge training.
247. 【2603.00545】Multiple Inputs and Mixwd data for Alzheimer's Disease Classification Based on 3D Vision Transformer
链接:https://arxiv.org/abs/2603.00545
作者:Juan A. Castro-Silva,Maria N. Moreno Garcia,Diego H. Peluffo-Ordoñez
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic Resonance Imaging, Magnetic Resonance, Alzheimer Disease, Resonance Imaging, significant limitations
备注:
点击查看摘要
Abstract:The current methods for diagnosing Alzheimer Disease using Magnetic Resonance Imaging (MRI) have significant limitations. Many previous studies used 2D Transformers to analyze individual brain slices independently, potentially losing critical 3D contextual information. Region of interest-based models often focus on only a few brain regions despite Alzheimer's affecting multiple areas. Additionally, most classification models rely on a single test, whereas diagnosing Alzheimer's requires a multifaceted approach integrating diverse data sources for a more accurate assessment. This study introduces a novel methodology called the Multiple Inputs and Mixed Data 3D Vision Transformer (MIMD-3DVT). This method processes consecutive slices together to capture the feature dimensions and spatial information, fuses multiple 3D ROI imaging data inputs, and integrates mixed data from demographic factors, cognitive assessments, and brain imaging. The proposed methodology was experimentally evaluated using a combined dataset that included the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarker, and Lifestyle Flagship Study of Ageing (AIBL), and the Open Access Series of Imaging Studies (OASIS). Our MIMD-3DVT, utilizing single or multiple ROIs, achieved an accuracy of 97.14%, outperforming the state-of-the-art methods in distinguishing between Normal Cognition and Alzheimer's Disease.
248. 【2603.00543】Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
链接:https://arxiv.org/abs/2603.00543
作者:Ke Cao,Xuanhua He,Xueheng Li,Lingting Zhu,Yingying Wang,Ao Ma,Zhanjie Zhang,Man Zhou,Chengjun Xie,Jie Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate high-resolution multi-spectral, aims to generate, detail of panchromatic, spectral richness, high-resolution multi-spectral images
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.
249. 【2603.00542】Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation
链接:https://arxiv.org/abs/2603.00542
作者:Yafei Zhang,Shuaitian Song,Huafeng Li,Shujuan Wang,Yu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:http URL address, http URL dual-guidance, http URL experiments, http URL results, http URL enables
备注: Aceepted by AAAI2026
点击查看摘要
Abstract:In real-world vision systems,haze removal is required not only to enhance image visibility but also to meet the specific needs of diverse downstream this http URL address this challenge,we propose a novel adaptive dynamic dehazing framework that incorporates a closed-loop optimization this http URL enables feedback-driven refinement based on downstream task performance and user instruction-guided adjustment during inference,allowing the model to satisfy the specific requirements of multiple downstream tasks without this http URL,our framework integrates two complementary and innovative mechanisms: (1)a task feedback loop that dynamically modulates dehazing outputs based on performance across multiple downstream tasks,and (2) a text instruction interface that allows users to specify high-level task this http URL dual-guidance strategy enables the model to adapt its dehazing behavior after training,tailoring outputs in real time to the evolving needs of multiple this http URL experiments across various vision tasks demonstrate the strong effectiveness,robustness,and generalizability of our this http URL results establish a new paradigm for interactive,task-adaptive dehazing that actively collaborates with downstream applications.
250. 【2603.00535】RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation
链接:https://arxiv.org/abs/2603.00535
作者:Xianhao Zhou,Jianghao Wu,Lanfeng Zhong,Ku Zhao,Jinlong He,Shaoting Zhang,Guotai Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unreliable Hounsfield Unit, Hounsfield Unit, unreliable Hounsfield, dose calculation, routinely acquired
备注:
点击查看摘要
Abstract:Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT--CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at this https URL.
251. 【2603.00529】CaptionFool: Universal Image Captioning Model Attacks
链接:https://arxiv.org/abs/2603.00529
作者:Swapnil Parekh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large-scale image-text datasets, encoder-decoder architectures trained, image-text datasets, making them susceptible, encoder-decoder architectures
备注:
点击查看摘要
Abstract:Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate "slang" terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.
252. 【2603.00527】P-Spikformer: Token Pruned Spiking Transformer
链接:https://arxiv.org/abs/2603.00527
作者:Wenjie Wei,Xiaolong Zhou,Malu Zhang,Ammar Belatreche,Qian Sun,Yimeng Shan,Dehao Zhang,Zijian Zhou,Zeyu Ma,Yang Yang,Haizhou Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:traditional neural networks, neural networks due, Spiking neural networks, event-driven computing paradigm, neural networks
备注: 24 pages, 7 figures
点击查看摘要
Abstract:Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens' importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.
253. 【2603.00526】Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
链接:https://arxiv.org/abs/2603.00526
作者:Zhen Zhou,Jian Liu,Biwen Lei,Jing Xu,Haohan Weng,Yiling Zhu,Zhuo Chen,Junfeng Fan,Yunkai Ma,Dazhao Du,Song Guo,Fengshui Jing,Chunchao Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely unexplored, demonstrated remarkable success, Reinforcement learning, generation remains largely, largely unexplored
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
254. 【2603.00519】Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness
链接:https://arxiv.org/abs/2603.00519
作者:Yuyang Chen,Linqian Zeng,Yijin ZHou,Hengjie Li,Jidong Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, requiring intensive full-attention, intensive full-attention computation, achieved remarkable success, computational efficiency remains
备注:
点击查看摘要
Abstract:Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at this https URL.
255. 【2603.00518】Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
链接:https://arxiv.org/abs/2603.00518
作者:Quan Kong,Yanru Xiao,Yuhao Shen,Cong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision research, Convolutional Neural Networks, expressive visual representation, traditional Convolutional Neural, efficient and expressive
备注:
点击查看摘要
Abstract:Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
256. 【2603.00515】MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence
链接:https://arxiv.org/abs/2603.00515
作者:Xingyilang Yin,Chengzhengxu Li,Jiahao Chang,Chi-Man Pun,Xiaodong Cun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans are born, born with vision-based, space over time, perceive and reason, purely visual inputs
备注:
点击查看摘要
Abstract:Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: this https URL.
257. 【2603.00512】Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
链接:https://arxiv.org/abs/2603.00512
作者:Wang Chen,Yuhui Zeng,Yongdong Luo,Tianyu Xie,Luojun Lin,Jiayi Ji,Yan Zhang,Xiawu Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, applying Large Vision-Language, limited context windows, Large Vision-Language, high frame redundancy
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
258. 【2603.00511】Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning
链接:https://arxiv.org/abs/2603.00511
作者:Ruoshuang Du,Xin Sun,Qiang Liu,Bowen Song,Zhongqi Chen,Weiqiang Wang,Liang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Question Answering systems, Visual Question Answering, Answering systems face, Question Answering, systems face reliability
备注: 8 pages, 6 figures
点击查看摘要
Abstract:Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments demonstrated that the model achieves a significant improvement in response performance in three VQA datasets. Meanwhile, ablation studies highlighted the importance of internal representations in adaptive retrieval decisions. In general, the experimental results demonstrated that MMA-RAG effectively balances external knowledge utilization and inference robustness in diverse multimodal scenarios.
259. 【2603.00510】What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.00510
作者:Yingqi Fan,Junlong Tong,Anhao Zhao,Xiaoyu Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal large language, large language models, language models, remain poorly understood, Multimodal large
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Multimodal large language models (MLLMs) project visual tokens into the embedding space of language models, yet the internal structuring and processing of visual semantics remain poorly understood. In this work, we introduce a two-fold analytical framework featuring a novel probing tool, $\textbf{EmbedLens}$, to conduct a fine-grained analysis. We uncover a pronounced semantic sparsity at the input level: visual tokens consistently partition into sink, dead, and alive categories. Remarkably, only the alive tokens, comprising $\approx60\%$ of the total input, carry image-specific meaning. Furthermore, using a targeted patch-compression benchmark, we demonstrate that these alive tokens already encode rich, fine-grained cues (e.g., objects, colors, and OCR) prior to entering the LLM. Internal visual computations (such as visual attention and feed-forward networks) are redundant for most standard tasks. For the small subset of highly vision-centric tasks that actually benefit from internal processing, we reveal that alive tokens naturally align with intermediate LLM layers rather than the initial embedding space, indicating that shallow-layer processing is unnecessary and that direct mid-layer injection is both sufficient. Ultimately, our findings provide a unified mechanistic view of visual token processing, paving the way for more efficient and interpretable MLLM architectures through selective token pruning, minimized visual computation, and mid-layer injection. The code is released at: this https URL.
260. 【2603.00504】Hierarchical Classification for Improved Histopathology Image Analysis
链接:https://arxiv.org/abs/2603.00504
作者:Keunho Byeon,Jinsol Song,Seong Min Hong,Yosep Chong,Jin Tae Kwak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Whole-slide image analysis, methods primarily rely, Whole-slide image, ignoring hierarchical relationships, existing deep learning
备注:
点击查看摘要
Abstract:Whole-slide image analysis is essential for diagnostic tasks in pathology, yet existing deep learning methods primarily rely on flat classification, ignoring hierarchical relationships among class labels. In this study, we propose HiClass, a hierarchical classification framework for improved histopathology image analysis, that enhances both coarse-grained and fine-grained WSI classification. Built based upon a multiple instance learning approach, HiClass extends it by introducing bidirectional feature integration that facilitates information exchange between coarse-grained and fine-grained feature representations, effectively learning hierarchical features. Moreover, we introduce tailored loss functions, including hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss, to further optimize hierarchical learning. We assess the performance of HiClass on a gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes, achieving superior classification performance for both coarse-grained classification and fine-grained classification. These results demonstrate the effectiveness of HiClass in improving WSI classification by capturing coarse-grained and fine-grained histopathological characteristics.
261. 【2603.00503】M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
链接:https://arxiv.org/abs/2603.00503
作者:Dawei Yan,Haokui Zhang,Guangda Huzhang,Yang Li,Yibo Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Ying Li,Wei Dong,Chunhua Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, autonomous web navigation, Multimodal Large, demonstrated remarkable potential
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
262. 【2603.00493】COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
链接:https://arxiv.org/abs/2603.00493
作者:Yuchen Che,Jingtu Wu,Hao Zheng,Asako Kanezaki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single reference view, due to occlusions, single reference, reference view, view is challenging
备注: CVPR2026 Accepted
点击查看摘要
Abstract:Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.
263. 【2603.00492】ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models
链接:https://arxiv.org/abs/2603.00492
作者:Riccardo de Lutio,Tobias Fischer,Yen-Yu Chang,Yuxuan Zhang,Jay Zhangjie Wu,Xuanchi Ren,Tianchang Shen,Katarina Tothova,Zan Gojcic,Haithem Turki
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:Gaussian Splatting provide, Gaussian Splatting, Per-scene optimization methods, Splatting provide, Per-scene optimization
备注: Video results: [this https URL](https://artifixer2026.github.io/)
点击查看摘要
Abstract:Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.
264. 【2603.00486】Random Wins All: Rethinking Grouping Strategies for Vision Tokens
链接:https://arxiv.org/abs/2603.00486
作者:Qihang Fan,Yuang Ai,Huaibo Huang,Ran He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:research efforts aim, grouping, random grouping, aim to address, quadratic complexity
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at this https URL.
265. 【2603.00483】RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
链接:https://arxiv.org/abs/2603.00483
作者:Liyao Jiang,Ruichen Chen,Chao Gao,Di Niu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieve remarkable realism, alignment remains challenging, faithful prompt-image alignment, prompt-image alignment remains, models achieve remarkable
备注: CVPR 2026
点击查看摘要
Abstract:Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at this https URL.
266. 【2603.00482】okenCom: Vision-Language Model for Multimodal and Multitask Token Communications
链接:https://arxiv.org/abs/2603.00482
作者:Feibo Jiang,Siwei Tu,Li Dong,Xiaolong Li,Kezhi Wang,Cunhua Pan,Zhu Han,Jiangzhou Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
关键词:Visual-Language Models, offer a solid, strong capabilities, solid foundation, foundation for intelligent
备注:
点击查看摘要
Abstract:Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system.
267. 【2603.00481】Analyzing Physical Adversarial Example Threats to Machine Learning in Election Systems
链接:https://arxiv.org/abs/2603.00481
作者:Khaleque Md Aashiq Kamal,Surya Eada,Aayushi Verma,Subek Acharya,Adrian Yemin,Benjamin Fuller,Kaleel Mahmood
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:adversarial, shown both promising, promising results, machine learning, Developments
备注: 20 pages, 8 figures, 28 tables
点击查看摘要
Abstract:Developments in the machine learning voting domain have shown both promising results and risks. Trained models perform well on ballot classification tasks ( 99% accuracy) but are at risk from adversarial example attacks that cause misclassifications. In this paper, we analyze an attacker who seeks to deploy adversarial examples against machine learning ballot classifiers to compromise a U.S. election. We first derive a probabilistic framework for determining the number of adversarial example ballots that must be printed to flip an election, in terms of the probability of each candidate winning and the total number of ballots cast. Second, it is an open question as to which type of adversarial example is most effective when physically printed in the voting domain. We analyze six different types of adversarial example attacks: l_infinity-APGD, l2-APGD, l1-APGD, l0 PGD, l0 + l_infinity PGD, and l0 + sigma-map PGD. Our experiments include physical realizations of 144,000 adversarial examples through printing and scanning with four different machine learning models. We empirically demonstrate an analysis gap between the physical and digital domains, wherein attacks most effective in the digital domain (l2 and l_infinity) differ from those most effective in the physical domain (l1 and l2, depending on the model). By unifying a probabilistic election framework with digital and physical adversarial example evaluations, we move beyond prior close race analyses to explicitly quantify when and how adversarial ballot manipulation could alter outcomes.
268. 【2603.00479】U-VLM: Hierarchical Vision Language Modeling for Report Generation
链接:https://arxiv.org/abs/2603.00479
作者:Pengcheng Shi,Minghui Zhang,Kehan Song,Jiaqi Liu,Yun Gu,Xinglin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated radiology report, improving diagnostic consistency, imaging remains challenging, medical imaging remains, reducing radiologist workload
备注:
点击查看摘要
Abstract:Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at this https URL.
269. 【2603.00478】Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols
链接:https://arxiv.org/abs/2603.00478
作者:Xu Luo,Ji Zhang,Lianli Gao,Heng Tao Shen,Jingkuan Song
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:lacks a unified, real-world usage, stronger pre-trained models, revolutionized by stronger, improved adaptation
备注: 13 pages
点击查看摘要
Abstract:Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation this http URL, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at this https URL.
270. 【2603.00467】High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System
链接:https://arxiv.org/abs/2603.00467
作者:Pengju Sun,Banglei Guan,Jing Tao,Zhenbao Yu,Xuanyu Bai,Yang Shang,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High dynamic range, extreme illumination remains, conventional cameras due, illumination remains challenging, High dynamic
备注: This paper has been accepted by Optics Express
点击查看摘要
Abstract:High dynamic range (HDR) imaging under extreme illumination remains challenging for conventional cameras due to overexposure. Event cameras provide microsecond temporal resolution and high dynamic range, while spatially varying exposure (SVE) sensors offer single-shot radiometric this http URL present a hardware--algorithm co-designed HDR imaging system that tightly integrates an SVE micro-attenuation camera with an event sensor in an asymmetric dual-modality configuration. To handle non-coaxial geometry and heterogeneous optics, we develop a two-stage cross-modal alignment framework that combines feature-guided coarse homography estimation with a multi-scale refinement module based on spatial pooling and frequency-domain filtering. On top of aligned representations, we develop a cross-modal HDR reconstruction network with convolutional fusion, mutual-information regularization, and a learnable fusion loss that adaptively balances intensity cues and event-derived structural constraints. Comprehensive experiments on both synthetic benchmarks and real captures demonstrate that the proposed system consistently improves highlight recovery, edge fidelity, and robustness compared with frame-only or event-only HDR pipelines. The results indicate that jointly optimizing optical design, cross-modal alignment, and computational fusion provides an effective foundation for reliable HDR perception in highly dynamic and radiometrically challenging environments.
271. 【2603.00466】DreamWorld: Unified World Modeling in Video Generation
链接:https://arxiv.org/abs/2603.00466
作者:Boming Tan,Xiangdong Zhang,Ning Liao,Yuqing Zhang,Shaofeng Zhang,Xue Yang,Qi Fan,Yanyong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing models remain, models remain limited, surface-level plausibility, lacking a coherent, impressive progress
备注:
点击查看摘要
Abstract:Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{this https URL}{\textcolor{mypink}{\textbf{Github}}}.
272. 【2603.00462】OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation
链接:https://arxiv.org/abs/2603.00462
作者:Zhaolin Yu,Litao Yang,Ben Babicka,Ming Hu,Jing Hao,Anthony Huang,James Huang,Yueming Jin,Jiasong Wu,Zongyuan Ge
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:standard panoramic radiograph, multiple diagnostic tasks, Vision Language Models, radiograph in dentistry, standard panoramic
备注: 10 pages, 2 figures
点击查看摘要
Abstract:Orthopantomograms (OPGs) are the standard panoramic radiograph in dentistry, used for full-arch screening across multiple diagnostic tasks. While Vision Language Models (VLMs) now allow multi-task OPG analysis through natural language, they underperform task-specific models on most individual tasks. Agentic systems that orchestrate specialized tools offer a path to both versatility and accuracy, this approach remains unexplored in the field of dental imaging. To address this gap, we propose OPGAgent, a multi-tool agentic system for auditable OPG interpretation. OPGAgent coordinates specialized perception modules with a consensus mechanism through three components: (1) a Hierarchical Evidence Gathering module that decomposes OPG analysis into global, quadrant, and tooth-level phases with dynamically invoking tools, (2) a Specialized Toolbox encapsulating spatial, detection, utility, and expert zoos, and (3) a Consensus Subagent that resolves conflicts through anatomical constraints. We further propose OPG-Bench, a structured-report protocol based on (Location, Field, Value) triples derived from real clinical reports, which enables a comprehensive review of findings and hallucinations, extending beyond the limitations of VQA indicators. On our OPG-Bench and the public MMOral-OPG benchmark, OPGAgent outperforms current dental VLMs and medical agent frameworks across both structured-report and VQA evaluation. Code will be released upon acceptance.
273. 【2603.00461】ReMoT: Reinforcement Learning with Motion Contrast Triplets
链接:https://arxiv.org/abs/2603.00461
作者:Cong Wan,Zeyu Guo,Jiangyang Li,SongLin Dong,Yifan Bai,Lin Peng,Zhiheng Ma,Yihong Gong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unified training paradigm, critical failure point, point in navigation, autonomous driving, Group Relative Policy
备注: cvpr 2026
点击查看摘要
Abstract:We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
274. 【2603.00459】Explainable Continuous-Time Mask Refinement with Local Self-Similarity Priors for Medical Image Segmentation
链接:https://arxiv.org/abs/2603.00459
作者:Rajdeep Chatterjee,Sudip Chakrabarty,Trishaani Acharjee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate semantic segmentation, automated wound monitoring, delineation remains challenging, remains challenging due, Accurate semantic
备注:
点击查看摘要
Abstract:Accurate semantic segmentation of foot ulcers is essential for automated wound monitoring, yet boundary delineation remains challenging due to tissue heterogeneity and poor contrast with surrounding skin. To overcome the limitations of standard intensity-based networks, we present LSS-LTCNet:an ante-hoc explainable framework synergizing deterministic structural priors with continuous-time neural dynamics. Our architecture departs from traditional black-box models by employing a Local Self-Similarity (LSS) mechanism that extracts dense, illumination-invariant texture descriptors to explicitly disentangle necrotic tissue from background artifacts. To enforce topological precision, we introduce a Liquid Time-Constant (LTC) refinement module that treats boundary evolution as an ODEgoverned dynamic system, iteratively refining masks over continuous time-steps. Comprehensive evaluation on the MICCAI FUSeg dataset demonstrates that LSS-LTCNet achieves state-of-the-art boundary alignment, securing a peak Dice score of 86.96% and an exceptional 95th percentile Hausdorff Distance (HD95) of 8.91 pixels. Requiring merely 25.70M parameters, the model significantly outperforms heavier U-Net and transformer baselines in efficiency. By providing inherent visual audit trails alongside high-fidelity predictions, LSS-LTCNet offers a robust and transparent solution for computer-aided diagnosis in mobile healthcare (mHealth) settings.
275. 【2603.00458】Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution
链接:https://arxiv.org/abs/2603.00458
作者:Bin Chen,Weiqi Li,Shijie Zhao,Xuanyu Zhang,Junlin Li,Li Zhang,Jian Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved impressive results, multi-step sampling leads, real-world video super-resolution, slow inference, achieved impressive
备注:
点击查看摘要
Abstract:While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
276. 【2603.00443】SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
链接:https://arxiv.org/abs/2603.00443
作者:Zhuoran Zhao,Xianghao Kong,Linlin Yang,Zheng Wei,Pan Hui,Anyi Rao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthetic training data, Recent studies, hand, demonstrated the effectiveness, effectiveness of synthetic
备注:
点击查看摘要
Abstract:Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
277. 【2603.00439】Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling
链接:https://arxiv.org/abs/2603.00439
作者:Xueyang Li,Yunzhong Lou,Yu Song,Xiangdong Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:parametric CAD sequence, parametric CAD sequences, parametric CAD, longer parametric CAD, CAD
备注: Accepted to AAAI 2025
点击查看摘要
Abstract:Computer-Aided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from this https URL.
278. 【2603.00437】Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models
链接:https://arxiv.org/abs/2603.00437
作者:April Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, made substantial progress, Vision-Language Models, Large Vision-Language, remains a challenge
备注:
点击查看摘要
Abstract:Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.
279. 【2603.00433】AP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis
链接:https://arxiv.org/abs/2603.00433
作者:Hui Wan,Libin Lan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Executing multiple tasks, Executing multiple, Vision Foundation Models, introduces significant challenges, multiple tasks simultaneously
备注: 4 pages, 2 figures, 4 tables; Submitted to ISBI FMC UIA 2026; Our code is publicly available at [this https URL](https://github.com/huiwanHW/Florence-2-adaptation)
点击查看摘要
Abstract:Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.
280. 【2603.00431】axonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
链接:https://arxiv.org/abs/2603.00431
作者:Hulingxiao He,Zhi Tan,Yuxin Peng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:general-purpose visual understanding, map visual inputs, Large Multimodal Models, images exist, visual understanding model
备注: Published as a conference paper at CVPR 2026
点击查看摘要
Abstract:A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at this https URL.
281. 【2603.00426】LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation
链接:https://arxiv.org/abs/2603.00426
作者:Cunyuan Yang,Dejuan Song,Xiaotao Pang,Qianqian Shen,Wenjie Nie,Yifan Huang,Lei Wu,Wei Han,Haishuai Wang,Jiajun Bu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:utilizing Multimodal Large, Multimodal Large Language, frequently encounters challenges, reports utilizing Multimodal, encounters challenges related
备注: 10 pages, 1 figure
点击查看摘要
Abstract:The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.
282. 【2603.00423】An Interpretable Local Editing Model for Counterfactual Medical Image Generation
链接:https://arxiv.org/abs/2603.00423
作者:Hyungi Min,Taeseung You,Hangyeul Lee,Yeongjae Cho,Sungzoon Cho
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enhancing AI-driven systems, domain by answering, Counterfactual medical image, critical tool, tool for enhancing
备注:
点击查看摘要
Abstract:Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering "what-if" questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.
283. 【2603.00418】Station2Radar: query conditioned gaussian splatting for precipitation field
链接:https://arxiv.org/abs/2603.00418
作者:Doyi Kim,Minseok Seo,Changick Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Precipitation forecasting relies, heterogeneous data, forecasting relies, relies on heterogeneous, Gaussian Splatting
备注: This paper was accepted to ICLR 2026
点击查看摘要
Abstract:Precipitation forecasting relies on heterogeneous data. Weather radar is accurate, but coverage is geographically limited and costly to maintain. Weather stations provide accurate but sparse point measurements, while satellites offer dense, high-resolution coverage without direct rainfall retrieval. To overcome these limitations, we propose Query-Conditioned Gaussian Splatting (QCGS), the first framework to fuse automatic weather station (AWS) observations with satellite imagery for generating precipitation fields. Unlike conventional 2D Gaussian splatting, which renders the entire image plane, QCGS selectively renders only queried precipitation regions, avoiding unnecessary computation in non-precipitating areas while preserving sharp precipitation structures. The framework combines a radar point proposal network that identifies rainfall-support locations with an implicit neural representation (INR) network that predicts Gaussian parameters for each point. QCGS enables efficient, resolution-flexible precipitation field generation in real time. Through extensive evaluation with benchmark precipitation products, QCGS demonstrates over 50\% improvement in RMSE compared to conventional gridded precipitation products, and consistently maintains high performance across multiple spatiotemporal scales.
284. 【2603.00413】DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects
链接:https://arxiv.org/abs/2603.00413
作者:Changpu Li,Shuang Wu,Songlin Tang,Guangming Lu,Jun Yu,Wenjie Pei
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:challenging task due, Reconstructing transparent objects, transparent objects, Reconstructing transparent, challenging task
备注:
点击查看摘要
Abstract:Reconstructing transparent objects from a set of multi-view images is a challenging task due to the complicated nature and indeterminate behavior of light propagation. Typical methods are primarily tailored to specific scenarios, such as objects following a uniform topology, exhibiting ideal transparency and surface specular reflections, or with only surface materials, which substantially constrains their practical applicability in real-world settings. In this work, we propose a differentiable rendering framework for transparent objects, dubbed DiffTrans, which allows for efficient decomposition and reconstruction of the geometry and materials of transparent objects, thereby reconstructing transparent objects accurately in intricate scenes with diverse topology and complex texture. Specifically, we first utilize FlexiCubes with dilation and smoothness regularization as the iso-surface representation to reconstruct an initial geometry efficiently from the multi-view object silhouette. Meanwhile, we employ the environment light radiance field to recover the environment of the scene. Then we devise a recursive differentiable ray tracer to further optimize the geometry, index of refraction and absorption rate simultaneously in a unified and end-to-end manner, leading to high-quality reconstruction of transparent objects in intricate scenes. A prominent advantage of the designed ray tracer is that it can be implemented in CUDA, enabling a significantly reduced computational cost. Extensive experiments on multiple benchmarks demonstrate the superior reconstruction performance of our DiffTrans compared with other methods, especially in intricate scenes involving transparent objects with diverse topology and complex texture. The code is available at this https URL.
285. 【2603.00412】PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
链接:https://arxiv.org/abs/2603.00412
作者:Yuanhao Su,Shaofeng Zhang,Xiaosong Jia,Qi Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-Language Models, autonomous driving, crucial for applications, applications in robotics, augmented reality
备注: CVPR 2026 Accepted
点击查看摘要
Abstract:The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{this https URL}{this https URL}.
286. 【2603.00409】SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
链接:https://arxiv.org/abs/2603.00409
作者:Yi Zhang,Youya Xia,Yong Wang,Meng Song,Xin Wu,Wenjun Wan,Bingbing Liu,AiXue Ye,Hongbo Zhang,Feng Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Multimodal Large, Large Language, Large Language Models, large language model
备注:
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling this http URL introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
287. 【2603.00382】DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography
链接:https://arxiv.org/abs/2603.00382
作者:Yujia Wu,Shuoqi Chen,Shiru Wang,Yucheng Tang,Petr Bruza,Geoffrey P. Luke
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling quantitative velocity, ultrasound computed tomography, quantitative velocity mapping, reveals subtle anatomical, subtle anatomical details
备注:
点击查看摘要
Abstract:Accurate Speed-of-Sound (SoS) reconstruction from acoustic waveforms is a cornerstone of ultrasound computed tomography (USCT), enabling quantitative velocity mapping that reveals subtle anatomical details and pathological variations often invisible in conventional imaging. However, practical utility is hindered by the limitations of existing algorithms; traditional Full Waveform Inversion (FWI) is computationally intensive, while current deep learning approaches tend to produce oversmoothed results lacking fine details. We propose DiffSOS, a conditional diffusion model that directly maps acoustic waveforms to SoS maps. Our framework employs a specialized acoustic ControlNet to strictly ground the denoising process in physical wave measurements. To ensure structural consistency, we optimize a hybrid loss function that integrates noise prediction, spatial reconstruction, and noise frequency content. To accelerate inference, we employ stochastic Denoising Diffusion Implicit Model (DDIM) sampling, achieving near real-time reconstruction with only 10 steps. Crucially, we exploit the stochastic generative nature of our framework to estimate pixel-wise uncertainty, providing a measure of reliability that is often absent in deterministic approaches. Evaluated on the OpenPros USCT benchmark, DiffSOS significantly outperforms state-of-the-art networks, achieving an average Multi-scale Structural Similarity of 0.957. Our approach provides high-fidelity SoS maps with a principled measure of confidence, facilitating safer and faster clinical interpretation.
288. 【2603.00372】Unsupervised Semantic Segmentation in Synchrotron Computed Tomography with Self-Correcting Pseudo Labels
链接:https://arxiv.org/abs/2603.00372
作者:Austin Yunker,Peter Kenesei,Hemant Sharma,Jun-Sang Park,Antonino Miceli,Rajkumar Kettimuthu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:X-ray computed tomography, monochromatic X-rays, X-ray computed, enabling improved data, improved data quality
备注:
点击查看摘要
Abstract:X-ray computed tomography (CT) is a widely used imaging technique that provides detailed examinations into the internal structure of an object with synchrotron CT (SR-CT) enabling improved data quality by using higher energy, monochromatic X-rays. While SR-CT allows for improved resolution, time-resolved experimentation, and reduced imaging artifacts, it also produces significantly larger datasets than conventional CT. Accurate and efficient evaluation of these datasets is a critical component of these workflows; yet is often done manually representing a major bottleneck in the analysis phase. While deep learning has emerged as a powerful tool capable of providing a wide range of purely data-driven solutions, it requires a substantial amount of labeled data for training and manual annotation of SR-CT datasets is impractical in practice. In this paper, we introduce a novel framework that enables automatic segmentation of large, high-resolution SR-CT datasets by eliminating the need to hand label images for deep learning training. First, we generate pseudo labels by clustering on the voxel values identifying regions in the volume with similar attenuation coefficients producing an initial semantic map. Afterwards, we train a segmentation model on the pseudo labels before utilizing the Unbiased Teacher approach to self-correct them ensuring accurate final segmentations. We find our approach improves pixel-wise accuracy and mIoU by 13.31% and 15.94%, respectively, over the baseline pseudo labels when using a magnesium crystal SR-CT sample. Additionally, we extensively evaluate the different components of our workflow including segmentation model, loss function, pseudo labeling strategy, and input type. Finally, we evaluate our approach on to two additional samples highlighting our frameworks ability to produce segmentations that are considerably better than the original pseudo labels.
289. 【2603.00368】Deep Learning-Based Meat Freshness Detection with Segmentation and OOD-Aware Classification
链接:https://arxiv.org/abs/2603.00368
作者:Hutama Arif Bramantyo,Mukarram Ali Faridi,Rui Chen,Clarissa Harris,Yin Sun
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:unpackaged meat datasets, freshness classification framework, meat freshness classification, images that supports, supports both packaged
备注:
点击查看摘要
Abstract:In this study, we present a meat freshness classification framework from Red-Green-Blue (RGB) images that supports both packaged and unpackaged meat datasets. The system classifies four in-distribution (ID) meat classes and uses an out-of-distribution (OOD)-aware abstention mechanism that flags low-confidence samples as No Result. The pipeline combines U-Net-based segmentation with deep feature classifiers. Segmentation is used as a preprocessing step to isolate the meat region and reduce background, producing more consistent inputs for classification. The segmentation module achieved an Intersection over Union (IoU) of 75% and a Dice coefficient of 82%, producing standardized inputs for the classification stage. For classification, we benchmark five backbones: Residual Network-50 (ResNet-50), Vision Transformer-Base/16 (ViT-B/16), Swin Transformer-Tiny (Swin-T), EfficientNet-B0, and MobileNetV3-Small. We use nested 5x3 cross-validation (CV) for model selection and hyperparameter tuning. On the held-out ID test set, EfficientNet-B0 achieves the highest accuracy (98.10%), followed by ResNet-50 and MobileNetV3-Small (both 97.63%) and Swin-T (97.51%), while ViT-B/16 is lower (94.42%). We additionally evaluate OOD scoring and thresholding using standard OOD metrics and sensitivity analysis over the abstention threshold. Finally, we report on-device latency using TensorFlow Lite (TFLite) on a smartphone, highlighting practical accuracy-latency trade-offs for future deployment.
290. 【2603.00362】Percept-Aware Surgical Planning for Visual Cortical Prostheses with Vascular Avoidance
链接:https://arxiv.org/abs/2603.00362
作者:Galen Pogoncheff,Alvin Wang,Jacob Granley,Michael Beyeler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:electrically stimulating neurons, early visual cortex, aim to restore, restore sight, sight by electrically
备注:
点击查看摘要
Abstract:Cortical visual prostheses aim to restore sight by electrically stimulating neurons in early visual cortex (V1). With the emergence of high-density and flexible neural interfaces, electrode placement within three-dimensional cortex has become a critical surgical planning problem. Existing strategies emphasize visual field coverage and anatomical heuristics but do not directly optimize predicted perceptual outcomes under safety constraints. We present a percept-aware framework for surgical planning of cortical visual prostheses that formulates electrode placement as a constrained optimization problem in anatomical space. Electrode coordinates are treated as learnable parameters and optimized end-to-end using a differentiable forward model of prosthetic vision. The objective minimizes task-level perceptual error while incorporating vascular avoidance and gray matter feasibility constraints. Evaluated on simulated reading and natural image tasks using realistic folded cortical geometry (FreeSurfer fsaverage), percept-aware optimization consistently improves reconstruction fidelity relative to coverage-based placement strategies. Importantly, vascular safety constraints eliminate margin violations while preserving perceptual performance. The framework further enables co-optimization of multi-electrode thread configurations under fixed insertion budgets. These results demonstrate how differentiable percept models can inform anatomically grounded, safety-aware computer-assisted planning for cortical neural interfaces and provide a foundation for optimizing next-generation visual prostheses.
291. 【2603.00337】Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors
链接:https://arxiv.org/abs/2603.00337
作者:Xuanshuo Fu,Lei Kang,Javier Vazquez-Corral
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:degrading visual quality, downstream vision tasks, impairing downstream vision, Control Embedding Module, low contrast
备注:
点击查看摘要
Abstract:Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net-based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. this https URL.
292. 【2603.00324】Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
链接:https://arxiv.org/abs/2603.00324
作者:Arya Fayyazi,Haleh Akrami
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:casts multimodal reasoning, explicit reliability guarantees, tool-using framework, framework that casts, casts multimodal
备注:
点击查看摘要
Abstract:We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget, expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy-compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.
293. 【2603.00289】Seeking Necessary and Sufficient Information from Multimodal Medical Data
链接:https://arxiv.org/abs/2603.00289
作者:Boyu Chen,Weiye Bao,Junjie Liu,Michael Shen,Bo Peng,Paul Taylor,Zhu Li,Mengyue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:provide richer information, data sources, sources can provide, provide richer, richer information
备注: 11 pages, 1 figure. Submitted to MICCAI 2026
点击查看摘要
Abstract:Learning multimodal representations from medical images and other data sources can provide richer information for decision-making. While various multimodal models have been developed for this, they overlook learning features that are both necessary (must be present for the outcome to occur) and sufficient (enough to determine the outcome). We argue learning such features is crucial as they can improve model performance by capturing essential predictive information, and enhance model robustness to missing modalities as each modality can provide adequate predictive signals. Such features can be learned by leveraging the Probability of Necessity and Sufficiency (PNS) as a learning objective, an approach that has proven effective in unimodal settings. However, extending PNS to multimodal scenarios remains underexplored and is non-trivial as key conditions of PNS estimation are violated. We address this by decomposing multimodal representations into modality-invariant and modality-specific components, then deriving tractable PNS objectives for each. Experiments on synthetic and real-world medical datasets demonstrate our method's effectiveness. Code will be available on GitHub.
294. 【2603.00273】Ozone Cues Mitigate Reflected Downwelling Radiance in LWIR Absorption-Based Ranging
链接:https://arxiv.org/abs/2603.00273
作者:Unay Dorken Gallastegi,Wentao Shangguan,Vaibhav Choudhary,Akshay Agarwal,Hoover Rueda-Chacón,Martin J. Stevens,Vivek K Goyal
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:Passive long-wave infrared, emitted thermal radiation, Passive long-wave, absorption-based ranging relies, long-wave infrared
备注: 15 pages, 10 figures
点击查看摘要
Abstract:Passive long-wave infrared (LWIR) absorption-based ranging relies on atmospheric absorption to estimate distances to objects from their emitted thermal radiation. First demonstrated decades ago for objects much hotter than the air and recently extended to scenes with low temperature variations, this ranging has depended on reflected radiance being negligible. Downwelling radiance is especially problematic, sometimes causing large inaccuracies. In two new ranging methods, we use characteristic features from ozone absorption to estimate the contribution of reflected downwelling radiance. The quadspectral method gives a simple closed-form range estimate from four narrowband measurements, two at a water vapor absorption line and two at an ozone absorption line. The hyperspectral method uses a broader spectral range to improve accuracy while also providing estimates of temperature, emissivity profiles, and contributions of downwelling from a collection of zenith angles. Experimental results demonstrate improved ranging accuracy, in one case reducing error from over 100 m when reflected light is not modeled to 6.8 m with the quadspectral method and 1.2 m with the hyperspectral method.
295. 【2603.00266】Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization
链接:https://arxiv.org/abs/2603.00266
作者:He Li,Wenyue He,Weihang Kong,Xingchen Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remain largely underexplored, prediction remain largely, Multimodal adversarial attacks, Multimodal adversarial, largely underexplored
备注: 12 pages, 8 figures
点击查看摘要
Abstract:Multimodal adversarial attacks for dense prediction remain largely underexplored. In particular, visual-infrared (VI) perception systems introduce unique challenges due to heterogeneous spectral characteristics and modality-specific intensity distributions. Existing adversarial patch methods are primarily designed for single-modal inputs and fail to account for crossspectral inconsistencies, leading to reduced attack effectiveness and poor stealthiness when applied to VI dense prediction models. To address these challenges, we propose a joint position-color optimization framework (AP-PCO) for generating adversarial patches in visual-infrared settings. The proposed method optimizes patch placement and color composition simultaneously using a fitness function derived from model outputs, enabling a single patch to perturb both visible and infrared modalities. To further bridge spectral discrepancies, we introduce a crossmodal color adaptation strategy that constrains patch appearance according to infrared grayscale characteristics while maintaining strong perturbations in the visible domain, thereby reducing cross-spectral saliency. The optimization procedure operates without requiring internal model information, supporting flexible black-box attacks. Extensive experiments on visual-infrared dense prediction tasks demonstrate that the proposed AP-PCO achieves consistently strong attack performance across multiple architectures, providing a practical benchmark for robustness evaluation in VI perception systems.
296. 【2603.00223】Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification
链接:https://arxiv.org/abs/2603.00223
作者:Giuseppe Sergioli,Carlo Cuccu,Giovanni Pasini,Alessandro Stefano,Giorgio Russo,Andrés Camilo Granda Arango,Roberto Giuntini
类目:Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
关键词:Pretty Good Measurement, operator-valued decision rule, decision rule derived, Pretty Good, Good Measurement
备注: 15 pages, 7 figures, 1 table, in preparation for journal submission
点击查看摘要
Abstract:We investigate a quantum-inspired approach to supervised multi-class classification based on the \emph{Pretty Good Measurement} (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.
297. 【2603.00217】Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection
链接:https://arxiv.org/abs/2603.00217
作者:Brianna D'Urso,Tahmid Hasan Sakib,Syed Rafay Hasan,Terry N. Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:Naturalistic Adversarial Patches, traffic sign setting, Traffic Sign Recognition, German Traffic Sign, Sign Recognition Benchmark
备注: Accepted to the 2nd IEEE Conference on Secure and Trustworthy CyberInfrastructure for IoT and Microelectronics (SaTC 2026), Houston, Texas, USA, March 24 to 26, 2026
点击查看摘要
Abstract:This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector's STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.
298. 【2603.00207】VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
链接:https://arxiv.org/abs/2603.00207
作者:Soumya Suvra Ghosal,Youngeun Kim,Zhuowei Li,Ritwick Chaudhry,Linghan Xu,Hongjing Zhang,Jakub Zablocki,Yifan Xing,Qin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:shown strong performance, complex reasoning tasks, reasoning, shown strong, large reasoning models
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
299. 【2603.00206】ACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models
链接:https://arxiv.org/abs/2603.00206
作者:Daniel Nobrega Medeiros
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:natural language prompts, Existing visual reasoning, subjective scoring procedures, narrow reasoning modalities, evaluate narrow reasoning
备注: 10 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: https://doi.org/10.57967/hf/7904).
300. 【2603.00201】AdURA-Net: Adaptive Uncertainty and Region-Aware Network
链接:https://arxiv.org/abs/2603.00201
作者:Antik Aich Roy,Ujjwal Bhattacharya
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:genuine diagnostic uncertainty, reflect genuine diagnostic, automated label extraction, diagnostic uncertainty, radiology reports
备注:
点击查看摘要
Abstract:One of the common issues in clinical decision-making is the presence of uncertainty, which often arises due to ambiguity in radiology reports, which often reflect genuine diagnostic uncertainty or limitations of automated label extraction in various complex cases. Especially the case of multilabel datasets such as CheXpert, MIMIC-CXR, etc., which contain labels such as positive, negative, and uncertain. In clinical decision-making, the uncertain label plays a tricky role as the model should not be forced to provide a confident prediction in the absence of sufficient evidence. The ability of the model to say it does not understand whenever it is not confident is crucial, especially in the cases of clinical decision-making involving high risks. Here, we propose AdURA-Net, a geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification. The key highlights of the proposed model are: a) Adaptive dilated convolution and multiscale deformable alignment coupled with the backbone Densenet architecture capturing the anatomical complexities of the medical images, and b) Dual Head Loss, which combines masked binary cross entropy with logit and a Dirichlet evidential learning objective.
301. 【2603.00198】Stateful Token Reduction for Long-Video Hybrid VLMs
链接:https://arxiv.org/abs/2603.00198
作者:Jindong Jiang,Amala Sanjay Deshmukh,Kateryna Chumachenko,Karan Sapra,Zhiding Yu,Guilin Liu,Andrew Tao,Pavlo Molchanov,Jan Kautz,Wonmin Byeon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:long-video vision-language models, accelerate long-video vision-language, dense Transformers, linear-time state-space blocks, vision-language models
备注:
点击查看摘要
Abstract:Token reduction is an effective way to accelerate long-video vision-language models (VLMs), but most existing methods are designed for dense Transformers and do not directly account for hybrid architectures that interleave attention with linear-time state-space blocks (e.g., Mamba). We study query-conditioned token reduction for hybrid video VLMs and analyze reduction behavior through two properties: layerwise sparsity (how many tokens capture query-relevant information) and importance stability (whether token-importance rankings persist across depth). Although token importance is sparse within each layer, the set of important tokens changes across layers, so aggressive early pruning is unreliable. Motivated by this, we propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks (using an implicit-attention proxy for Mamba), enabling all-layer token reduction in hybrids. Under an aggressive compression setting (retaining 25% of visual tokens), our approach delivers substantial prefilling speedups (3.8--4.2x) with near-baseline accuracy at test time, and light finetuning under reduction further improves performance on long-context video benchmarks.
302. 【2603.00197】A Case Study on Concept Induction for Neuron-Level Interpretability in CNN
链接:https://arxiv.org/abs/2603.00197
作者:Moumita Sen Sarma,Samatha Ereshi Akkamahadevi,Pascal Hitzler
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deep Neural Networks, remain poorly understood, Deep Neural, Neural Networks, neurons remain poorly
备注:
点击查看摘要
Abstract:Deep Neural Networks (DNNs) have advanced applications in domains such as healthcare, autonomous systems, and scene understanding, yet the internal semantics of their hidden neurons remain poorly understood. Prior work introduced a Concept Induction-based framework for hidden neuron analysis and demonstrated its effectiveness on the ADE20K dataset. In this case study, we investigate whether the approach generalizes by applying it to the SUN2012 dataset, a large-scale scene recognition benchmark. Using the same workflow, we assign interpretable semantic labels to neurons and validate them through web-sourced images and statistical testing. Our findings confirm that the method transfers to SUN2012, showing its broader applicability.
303. 【2603.00194】SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models
链接:https://arxiv.org/abs/2603.00194
作者:Yang Yang,Xinze Zou,Zehua Ma,Han Fang,Weiming Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:raised growing concerns, copyright protection, malicious misuse, raised growing, growing concerns
备注: 11 pages, 6 figures
点击查看摘要
Abstract:The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
304. 【2603.00191】ask-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning
链接:https://arxiv.org/abs/2603.00191
作者:Lingfeng He,De Cheng,Huaijie Wang,Xi Yang,Nannan Wang,Xinbo Gao
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Continual Learning, requires models, models to sequentially, sequentially adapt, Continual
备注: preprint
点击查看摘要
Abstract:Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods.
305. 【2603.00188】Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
链接:https://arxiv.org/abs/2603.00188
作者:Bowen Zhou,Zhou Xu,Wanli Li,Jingyu Xiao,Haoqian Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Vision-Language Models, substantial memory footprint, Large Vision-Language, Vision-Language Models, autonomous GUI agents
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.
306. 【2603.00184】Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1
链接:https://arxiv.org/abs/2603.00184
作者:Abhinav Munagala
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:complex plumage patterns, extreme pose diversity, variable lighting conditions, computer vision due, Bird image segmentation
备注:
点击查看摘要
Abstract:Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt "bird" before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.
307. 【2603.00175】Infinite Self-Attention
链接:https://arxiv.org/abs/2603.00175
作者:Giorgio Roffo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limits Transformer scalability, attention limits Transformer, softmax attention limits, scalability in high-resolution, introduce Infinite Self-Attention
备注:
点击查看摘要
Abstract:The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).
308. 【2603.00173】Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model
链接:https://arxiv.org/abs/2603.00173
作者:Simo Ryu,Chunghwan Han
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:video foundation model, foundation model developed, experience training, describe our experience, video foundation
备注: 28 pages, 16 figures, 5 tables
点击查看摘要
Abstract:We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $\mu$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $\mu$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.
309. 【2603.00171】AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning
链接:https://arxiv.org/abs/2603.00171
作者:Yuxiang Shen,Hailong Huang,Zhenkun Gao,Xueheng Li,Chengjun Xie,Xuanhua He,Jie Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, exploring image details, actively exploring image, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which adds overhead and noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose AdaFocus, a novel training-free framework designed for adaptive visual reasoning. AdaFocus follows a two-stage pipeline: a confidence-based module decides when to crop, and a semantic-guided localization module determines where to crop. This enables adaptive visual reasoning without additional training. Experimentally, AdaFocus delivers substantial performance gains while achieving approximately 4.0\times speedup inference speedup than the SOTA method ZoomEyes, representing a significant advance in both accuracy and efficiency.
310. 【2603.00170】A Novel Evolutionary Method for Automated Skull-Face Overlay in Computer-Aided Craniofacial Superimposition
链接:https://arxiv.org/abs/2603.00170
作者:Práxedes Martínez-Moreno,Andrea Valsecchi,Pablo Mesejo,Pilar Navarro-Ramírez,Valentino Lugli,Sergio Damas
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
关键词:Craniofacial Superimposition, identifying skeletal remains, ante-mortem facial photographs, technique for identifying, identifying skeletal
备注: 11 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Craniofacial Superimposition is a forensic technique for identifying skeletal remains by comparing a post-mortem skull with ante-mortem facial photographs. A critical step in this process is Skull-Face Overlay (SFO). This stage involves aligning a 3D skull model with a 2D facial image, typically guided by cranial and facial landmarks' correspondence. However, its accuracy is undermined by individual variability in soft-tissue thickness, introducing significant uncertainty into the overlay. This paper introduces Lilium, an automated evolutionary method to enhance the accuracy and robustness of SFO. Lilium explicitly models soft-tissue variability using a 3D cone-based representation whose parameters are optimized via a Differential Evolution algorithm. The method enforces anatomical, morphological, and photographic plausibility through a combination of constraints: landmark matching, camera parameter consistency, head pose alignment, skull containment within facial boundaries, and region parallelism. This emulation of the usual forensic practitioners' approach leads Lilium to outperform the state-of-the-art method in terms of both accuracy and robustness.
311. 【2603.00168】Image-Based Classification of Olive Species Specific to Turkiye with Deep Neural Networks
链接:https://arxiv.org/abs/2603.00168
作者:Irfan Atabas,Hatice Karatas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:automatically classify local, cultivated in Turkiye, classify local olive, local olive species, olive species cultivated
备注:
点击查看摘要
Abstract:In this study, image processing and deep learning methodologies were employed to automatically classify local olive species cultivated in Turkiye. A stereo camera was utilized to capture images of five distinct olive species, which were then preprocessed to ensure their suitability for analysis. Convolutional Neural Network (CNN) architectures, specifically MobileNetV2 and EfficientNetB0, were employed for image classification. These models were optimized through a transfer learning approach. The training and testing results indicated that the EfficientNetB0 model exhibited the optimal performance, with an accuracy of 94.5%. The findings demonstrate that deep learning-based systems offer an effective solution for classifying olive species with high accuracy. The developed method has significant potential for application in areas such as automatic identification and quality control of agricultural products.
312. 【2603.00166】Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
链接:https://arxiv.org/abs/2603.00166
作者:Hongyu Li,Kuan Liu,Yuan Chen,Juntao Hu,Huimin Lu,Guanjie Chen,Xue Liu,Guangming Lu,Hong Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:produce high-quality content, demonstrated remarkable ability, Recent advances, high-quality content, demonstrated remarkable
备注:
点击查看摘要
Abstract:Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit "Paradox of Simplicity": while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.
313. 【2603.00165】ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering
链接:https://arxiv.org/abs/2603.00165
作者:Zhaodong Wu,Haochen Xue,Qi Cao,Wenqi Mo,Yu Pei,Wenqi Xu,Jionglong Su,Yang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Images improves fine-grained, Thinking with Images, Images improves, improves fine-grained VQA, fine-grained VQA
备注:
点击查看摘要
Abstract:Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.
314. 【2603.00163】A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance
链接:https://arxiv.org/abs/2603.00163
作者:Nicholas Korcynski
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:extreme class imbalance, thin-stroke subset averages, whiteboard strokes, stroke pixels, class imbalance
备注: 10 pages, 8 figures. Preprint
点击查看摘要
Abstract:The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.
315. 【2603.00161】SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision
链接:https://arxiv.org/abs/2603.00161
作者:S. Kalaycioglu,C. Hong,M. Zhu,H. Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Early ophthalmic screening, Eye Aspect Ratio, FaceMesh Eye Aspect, trained practitioners, low-resource and remote
备注: 25 pages , 7 figures, 5 tables
点击查看摘要
Abstract:Early ophthalmic screening in low-resource and remote settings is constrained by access to specialized equipment and trained practitioners. We present SKINOPATHY AI, a smartphone-first web application that delivers five complementary, explainable screening modules entirely through commodity mobile hardware: (1) redness quantification via LAB a* color-space normalization; (2) blink-rate estimation using MediaPipe FaceMesh Eye Aspect Ratio (EAR) with adaptive thresholding; (3) pupil light reflex characterization through Pupil-to-Iris Ratio (PIR) time-series analysis; (4) scleral color indexing foricterus and anemia proxies via LAB/HSV statistics; and (5) iris-landmark-calibrated lesion encroachment measurement with millimeter-scale estimates and longitudinal trend tracking. The system is implemented as a React/FastAPI stack with OpenCV and MediaPipe, MongoDB-backed session persistence, and PDF report generation. All algorithms are fully deterministic, privacy-preserving, and designed for non-diagnostic consumer triage. We detail system architecture, algorithm design, evaluation methodology, clinical context, and ethical boundaries of the platform. SKINOPATHY AI demonstrates that multi-signal ophthalmic screening is feasible on unmodified smartphones without cloud-based AI inference, providing a foundation for future clinically validated mobile ophthalmoscopy tools.
316. 【2603.00160】DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops
链接:https://arxiv.org/abs/2603.00160
作者:Boyang Deng,Yuzhen Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Developing robust models, precision vegetable weeding, annotated weed-crop datasets, Developing robust, scarcity of large-scale
备注: 10 pages, 2 figures
点击查看摘要
Abstract:Developing robust models for precision vegetable weeding is currently constrained by the scarcity of large-scale, annotated weed-crop datasets. To address this limitation, this study proposes a foundational crop-weed detection model by integrating heterogeneous datasets and leveraging self-supervised learning. A total of 618,642 crop-weed images were initially collected and subsequently refined to 199,388 filtered images for fine-tuning a DINOv3 vision transformer (ViT-small) through a sequential curation strategy. The fine-tuned DINOv3 backbone was then integrated into YOLO26, serving either as a primary backbone or part of a dual-backbone architecture. A feature alignment loss was introduced in the dual backbone framework to enhance feature fusion with minimal computational overhead. Experimental results show that the proposed DINOv3-finetuned ViT-small-based YOLO26-large achieved up to a +5.4% mAP50 gain on in-domain images collected in the 2025 season. Moreover, it demonstrated strong cross-domain generalization with mAP50 improvements of +14.0% on the 2021-2023 season dataset and +11.9% on the 2024 season dataset, compared to the standard YOLO26-large. Although the DINOv3-YOLO26-large model has 45.6% more parameters and a 2.9x increase in inference latency, it maintains real-time performance at ~28.5 frames per second (fps). The curated dataset and software programs developed in this study will be made publicly available.
317. 【2603.00159】FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
链接:https://arxiv.org/abs/2603.00159
作者:Weiting Tan,Andy T. Liu,Ming Tu,Xinghua Qu,Philipp Koehn,Lu Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
关键词:imperfect lip synchronization, Generating realistic talking-head, remains challenging due, Generating realistic, videos remains challenging
备注:
点击查看摘要
Abstract:Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
318. 【2603.00157】FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility
链接:https://arxiv.org/abs/2603.00157
作者:Bryceton Bible,Shah Md Nehal Hasnaeen,Hairong Qi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapidly changing atmospheric, changing atmospheric conditions, Mount Fuji, visitor experience, natural landmarks
备注: 9 pages (including references), 8 figures, 2 tables. Accepted to the IEEE/CVF WACV 2026 proceedings. Introduces a large human-labeled Mount Fuji visibility dataset; public release forthcoming
点击查看摘要
Abstract:Visibility of natural landmarks such as Mount Fuji is a defining factor in both tourism planning and visitor experience, yet it remains difficult to predict due to rapidly changing atmospheric conditions. We present FujiView, a multimodal learning framework and dataset for predicting scenic visibility by fusing webcam imagery with structured meteorological data. Our late-fusion approach combines image-derived class probabilities with numerical weather features to classify visibility into five categories. The dataset currently comprises over 100,000 webcam images paired with concurrent and forecasted weather conditions from more than 40 cameras around Mount Fuji, and continues to expand; it will be released to support further research in environmental forecasting. Experiments show that YOLO-based vision features dominate short-term horizons such as "nowcasting" and "samedaycasting", while weather-driven forecasts increasingly take over as the primary predictive signal beyond $+1$d. Late fusion consistently yields the highest overall accuracy, achieving ACC of approx 0.89 for same-day prediction and up to 84% for next-day forecasts. These results position Scenic Visibility Forecasting (SVF) as a new benchmark task for multimodal learning.
319. 【2603.00156】BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation
链接:https://arxiv.org/abs/2603.00156
作者:Saivan Talaei,Fatemeh Daneshfar,Abdulhady Abas Abdullah,Mustaqeem Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:treatment planning, Medical image segmentation, cornerstone of computer-assisted, computer-assisted diagnosis, diagnosis and treatment
备注:
点击查看摘要
Abstract:Medical image segmentation is a cornerstone of computer-assisted diagnosis and treatment planning. While recent multimodal vision-language models have shown promise in enhancing semantic understanding through textual descriptions, their resilience in "in-the-wild" clinical settings-characterized by scarce annotations and hardware-induced image degradations-remains under-explored. We introduce BiCLIP (Bidirectional and Consistent Language-Image Processing), a framework engineered to bolster robustness in medical segmentation. BiCLIP features a bidirectional multimodal fusion mechanism that enables visual features to iteratively refine textual representations, ensuring superior semantic alignment. To further stabilize learning, we implement an augmentation consistency objective that regularizes intermediate representations against perturbed input views. Evaluation on the QaTa-COV19 and MosMedData+ benchmarks demonstrates that BiCLIP consistently surpasses state-of-the-art image-only and multimodal baselines. Notably, BiCLIP maintains high performance when trained on as little as 1% of labeled data and exhibits significant resistance to clinical artifacts, including motion blur and low-dose CT noise.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.00156 [cs.CV]
(or
arXiv:2603.00156v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.00156
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
320. 【2603.00155】EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection
链接:https://arxiv.org/abs/2603.00155
作者:Wenxin Tang,Jingyu Xiao,Yanpei Gong,Fengyuan Ran,Tongchuan Xia,Junliang Liu,Man Ho Lam,Wenxuan Wang,Michael R. Lyu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:visually coherent presentations, distill lengthy research, lengthy research papers, Multimodal Large Language, Large Language Models
备注:
点击查看摘要
Abstract:Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at this https URL.
321. 【2603.00152】Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
链接:https://arxiv.org/abs/2603.00152
作者:Haoxiang Sun,Tao Wang,Chenwei Tang,Li Yuan,Jiancheng Lv
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Relative Policy Optimization, Group Relative Policy, Visual Large Language, Policy Optimization, Group Relative
备注:
点击查看摘要
Abstract:Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.~Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.~Seg improves performance in complex visual scenarios while maintaining strong generalization. Code and models will be available at this https URL.
322. 【2603.00151】Multiview Progress Prediction of Robot Activities
链接:https://arxiv.org/abs/2603.00151
作者:Elena Zoppellari,Federico Becattini,Marco Fiorucci,Lamberto Ballan
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:safely alongside humans, alongside humans, operate effectively, effectively and safely, safely alongside
备注: Accepted at ICASSP 2026
点击查看摘要
Abstract:For robots to operate effectively and safely alongside humans, they must be able to understand the progress of ongoing actions. This ability, known as action progress prediction, is critical for tasks ranging from timely assistance to autonomous decision-making. However, modeling action progression in robotics has often been overlooked. Moreover, a single camera may be insufficient for understanding robot's ego-actions, as self-occlusion can significantly hinder perception and model performance. In this paper, we propose a multi-view architecture for action progress prediction in robot manipulation tasks. Experiments on Mobile ALOHA demonstrate the effectiveness of the proposed approach.
323. 【2603.00150】Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!
链接:https://arxiv.org/abs/2603.00150
作者:Zihang Zou,Boqing Gong,Liqiang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:critical threat posed, highlight a critical, critical threat, threat posed, posed by emerging
备注: Accepted to ICCV 2025. Code available at: [this https URL](https://github.com/zzzucf/Neural-Plagiarism)
点击查看摘要
Abstract:In this paper, we highlight a critical threat posed by emerging neural models: data plagiarism. We demonstrate how modern neural models (e.g., diffusion models) can replicate copyrighted images, even when protected by advanced watermarking techniques. To expose vulnerabilities in copyright protection and facilitate future research, we propose a general approach to neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on "anchors and shims", employs inverse latents as anchors and finds shim perturbations that gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbations to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modification in copyrighted images, enabling it to bypass protections ranging from visible trademarks and signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.
324. 【2603.00149】Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
链接:https://arxiv.org/abs/2603.00149
作者:Zhihao Li,Shengwei Dong,Chuang Yi,Junxuan Gao,Zhilu Lai,Zhiqiang Liu,Wei Wang,Guangtao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:ignore physical constraints, models transfer poorly, Existing image, generic diffusion models, diffusion models transfer
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)
点击查看摘要
Abstract:Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on this https URL.
325. 【2603.00148】Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models
链接:https://arxiv.org/abs/2603.00148
作者:Binesh Sadanandan,Vahid Behzadan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical Vision-Language Models, Vision-Language Models, Models can give, Sadanandan and Behzadan, medical VQA
备注:
点击查看摘要
Abstract:Medical Vision-Language Models can give different yes or no answers to rephrasings of the same clinical question. We study this in MedGemma-4B using PSF-Med Sadanandan and Behzadan (2025), which provides paraphrase pairs for systematic consistency evaluation on medical VQA. On MIMIC-CXR binary questions (n = 158), the baseline flip rate is 14.6% and mean margin difference is 1.63 logits. We validate that Gemma Scope 2 Sparse Autoencoders (SAEs) transfer to MedGemma activations, achieving R2 ~= 0.997 on both medical and general text (n = 100 prompts each, p 0.001 for exceeding a 0.95 threshold). We then fine-tune Low-Rank Adaptation (LoRA) adapters with a combined loss that balances paraphrase consistency with answer accuracy. This combined approach prevents mode collapse that occurs with pure consistency training while reducing flip rate from 14.6% to 4.4% (p = 0.002, two-proportion z-test) and margin difference from 1.63 to 0.33 (79.5% reduction). Accuracy remains stable at 84.2% baseline versus 82.3% after training (-1.9pp, not significant). On PadChest Balanced (n = 250), flip rate drops from 13.6% to 7.8%, mean margin difference drops from 1.08 to 0.35 (67.9% reduction), and accuracy increases from 66.4% to 69.4%. A layer-range ablation shows that early layers reduce margin differences more than mechanistically selected middle layers.
326. 【2603.00147】Leveraging GenAI for Segmenting and Labeling Centuries-old Technical Documents
链接:https://arxiv.org/abs/2603.00147
作者:Carlos Monroy,Benjamin Navarro
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
关键词:established computational techniques, established computational, broader discipline, Image segmentation, image processing
备注: 6 pages, 7 figures
点击查看摘要
Abstract:Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an image. These techniques have shown remarkable accuracy with modern images, mainly because the amount of training data is vast. Achieving similar accuracy in digitized images of centuries-old documents is more challenging. This difficulty is due to two main reasons: first, the lack of sufficient training data, and second, because the degree of specialization in a given domain. Despite these limitations, the ability to segment and recognize objects in these collections is important for automating the curation, cataloging, and dissemination of knowledge, making the contents of priceless collections accessible to scholars and the general public. In this paper, we report on our ongoing work in segmenting and labeling images pertaining to shipbuilding treatises from the XVI and XVII centuries, a historical period known as the Age of Exploration. To this end, we leverage SAM2 for image segmentation; Florence2 and ChatGPT for labeling; and a specialized ontology ontoShip and glossary glosShip of nautical architecture for enhancing the labeling process. Preliminary results demonstrate the potential of marrying these technologies for improving curation and retrieval of priceless historical documents. We also discuss the challenges and limitations encountered in this approach and ideas on how to overcome them in the future.
327. 【2603.00145】M-Gaussian: An Magnetic Gaussian Framework for Efficient Multi-Stack MRI Reconstruction
链接:https://arxiv.org/abs/2603.00145
作者:Kangyuan Zheng,Xuan Cai,Jiangqi Wang,Guixing Fu,Zhuoshuo Li,Yazhou Chen,Xinting Ge,Liangqiong Qu,Mengting Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Magnetic Resonance Imaging, non-invasive imaging modality, crucial non-invasive imaging, Magnetic Resonance, Resonance Imaging
备注: 15 pages, 9 figures
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) is a crucial non-invasive imaging modality. In routine clinical practice, multi-stack thick-slice acquisitions are widely used to reduce scan time and motion sensitivity, particularly in challenging scenarios such as fetal brain imaging. However, the resulting severe through-plane anisotropy compromises volumetric analysis and downstream quantitative assessment, necessitating robust reconstruction of isotropic high-resolution volumes. Implicit neural representation methods, while achieving high quality, suffer from computational inefficiency due to complex network structures. We present M-Gaussian, adapting 3D Gaussian Splatting to MRI reconstruction. Our contributions include: (1) Magnetic Gaussian primitives with physics-consistent volumetric rendering, (2) neural residual field for high-frequency detail refinement, and (3) multi-resolution progressive training. Our method achieves an optimal balance between quality and speed. On the FeTA dataset, M-Gaussian achieves 40.31 dB PSNR while being 14 times faster, representing the first successful adaptation of 3D Gaussian Splatting to multi-stack MRI reconstruction.
328. 【2603.00144】Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation
链接:https://arxiv.org/abs/2603.00144
作者:Zichen Geng,Zeeshan Hayder,Bo Miao,Jian Liu,Wei Liu,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:requires coherent modeling, Generating realistic, requires coherent, coherent modeling, Hierarchical Variational Autoencoder
备注:
点击查看摘要
Abstract:Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.
329. 【2603.00143】GrapHist: Graph Self-Supervised Learning for Histopathology
链接:https://arxiv.org/abs/2603.00143
作者:Sevda Öğüt,Cédric Vincent-Cuaz,Natalia Dubljevic,Carlos Hurtado,Vaishnavi Subramanian,Pascal Frossard,Dorina Thanou
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:achieved notable success, achieved notable, notable success, Self-supervised vision models, Self-supervised vision
备注:
点击查看摘要
Abstract:Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at this https URL , establishing the first large-scale graph benchmark in this field. Our code is available at this https URL .
330. 【2603.00141】From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
链接:https://arxiv.org/abs/2603.00141
作者:Xiangyan Qu,Zhenlong Yuan,Jing Tang,Rui Chen,Datao Tang,Meng Yu,Lei Sun,Yancheng Bai,Xiangxiang Chu,Gaopeng Gou,Gang Xiong,Yujun Cai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:extending inference time, inference time, paradigm that improves, extending inference, improves image generation
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
点击查看摘要
Abstract:Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
331. 【2603.00140】Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion
链接:https://arxiv.org/abs/2603.00140
作者:Sathwik Karnik,Juyeop Kim,Sanmi Koyejo,Jong-Seok Lee,Somil Bansal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:memorize training data, training data, revealing a fundamental, memorize training, fundamental failure
备注:
点击查看摘要
Abstract:Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: this https URL.
332. 【2603.00139】owards Data-driven Nitrogen Estimation in Wheat Fields using Multispectral Images
链接:https://arxiv.org/abs/2603.00139
作者:Andreas Tritsarolis,Tomaž Bokan,Matej Brumen,Domen Mongus,Yannis Theodoridis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduce environmental impacts, improve resource utilization, modernization of agriculture, agriculture has motivated, motivated the development
备注:
点击查看摘要
Abstract:The modernization of agriculture has motivated the development of advanced analytics and decision-support systems to improve resource utilization and reduce environmental impacts. Targeted Spraying and Fertilization (TSF) is a critical operation that enables farmers to apply inputs more precisely, optimizing resource use and promoting environmental sustainability. However, accurate TSF is a challenging problem, due to external factors such as crop type, fertilization phase, soil conditions, and weather dynamics. In this paper, we present TerrAI, a Neural Network-based solution for TSF, which considers the spatio-temporal variability across different parcels. Our experimental study over a real-world remote sensing dataset validates the soundness of TerrAI on data-driven agricultural practices.
333. 【2603.00138】Latent Replay Detection: Memory-Efficient Continual Object Detection on Microcontrollers via Task-Adaptive Compression
链接:https://arxiv.org/abs/2603.00138
作者:Bibin Wilson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deploying object detection, Deploying object, enables intelligent edge, enables intelligent, categories after deployment
备注:
点击查看摘要
Abstract:Deploying object detection on microcontrollers (MCUs) enables intelligent edge devices but current models cannot learn new object categories after deployment. Existing continual learning methods require storing raw images far exceeding MCU memory budgets of tens of kilobytes. We present Latent Replay Detection (LRD), the first framework for continual object detection under MCU memory constraints. Our key contributions are: 1. Task-Adaptive Compression: Unlike fixed PCA, we propose learnable compression with FiLM (Feature-wise Linear Modulation) conditioning, where task specific embeddings modulate the compression to preserve discriminative features for each task's distribution; 2. Spatial-Diverse Exemplar Selection: Traditional sampling ignores spatial information critical for detection - we select exemplars maximizing bounding box diversity via farthest-point sampling in IoU space, preventing localization bias in replay; 3. MCU-Deployable System: Our latent replay stores 150 bytes per sample versus 10KB for images, enabling a 64KB buffer to hold 400+ exemplars. Experiments on CORe50 (50 classes, 5 tasks) demonstrate that LRD achieves mAP@50 on the initial task and maintains strong performance across subsequent tasks - a significant improvement over naive fine-tuning while operating within strict MCU constraints. Our task-adaptive FiLM compression and spatial diverse exemplar selection work synergistically to preserve detection capabilities. Deployed on STM32H753ZI, ESP32-S3, and MAX78000 MCUs, LRD achieves 4.9-97.5ms latency per inference within a 64KB memory budget-enabling practical continual detection on edge devices for the first time.
334. 【2603.00136】nyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
链接:https://arxiv.org/abs/2603.00136
作者:Bibin Wilson
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:micro controller units, current approaches rely, detection enables recognising, object detection enables, vision language models
备注:
点击查看摘要
Abstract:Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.
335. 【2603.00133】You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models
链接:https://arxiv.org/abs/2603.00133
作者:Kairan Zhao,Eleni Triantafillou,Peter Triantafillou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Generative models, near-verbatim generating images, copyright infringement, near-verbatim generating, privacy concerns
备注:
点击查看摘要
Abstract:Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.
336. 【2603.00132】Predicting Local Climate Zones using Urban Morphometrics and Satellite Imagery
链接:https://arxiv.org/abs/2603.00132
作者:Hugo Majer,Martin Fleischmann
类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:Local Climate Zone, Climate Zone, Local Climate, mapping predominantly relies, LCZ
备注:
点击查看摘要
Abstract:The Local Climate Zone (LCZ) framework is commonly employed to represent urban form in morphological analyses despite its mapping predominantly relies on satellite imagery. Urban morphometrics, describing urban form via numerical measures of physical aspects and spatial relationships of its elements, offers another avenue. This study evaluates the ability of morphometric assessment to predict LCZs using a) a morphometric-based LCZ prediction, and b) a fusion-based LCZ prediction combining morphometrics with satellite imagery. We calculate 321 2D morphometric attributes from building footprints and street networks, covering their various properties at multiple spatial scales. Subsequently, we develop four classification schemes: morphometric-based prediction, baseline image-based prediction, and two techniques fusing morphometrics with imagery. We evaluate them across five sites. Results from the morphometric-based prediction indicate that the correspondence between 2D urban morphometrics and urban LCZ types is selective and inconsistent, rendering the efficacy of this method site-dependent. Nevertheless, it demonstrated that a much broader range of urban form properties is relevant for distinguishing LCZ types compared to standard parameters. Relative to the image-based baseline, the fusion yielded relatively distinct accuracy improvements for urban LCZ types at two sites; however, gains at the remaining sites were negligible or even slightly negative, suggesting that the benefits of fusion are modest and inconsistent. Collectively, these results indicate that the relationship between the LCZs and the measurable, visible aspects of urban form is tenuous, thus the LCZ framework should be used with caution in morphological studies.
337. 【2603.00127】Segmenting Low-Contrast XCTs of Concretes: An Unsupervised Approach
链接:https://arxiv.org/abs/2603.00127
作者:Kaustav Das,Gaston Rauchs,Jan Sykora,Anna Kucerova
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:X-ray computed tomography, convolutional neural network, X-ray computed, neural network, computed tomography
备注:
点击查看摘要
Abstract:This work tests a self-annotation-based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X-ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X-ray attenuation coefficients of aggregates and mortar, resulting in low-contrast between the two phases in the ensuing images. While CNN-based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self-annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN-based model. This enables the model to learn a global-local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.
338. 【2603.00126】QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference
链接:https://arxiv.org/abs/2603.00126
作者:Miao Zhang,Ruixiao Zhang,Jianxin Shi,Hengzhi Wang,Hao Fang,Jiangchuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM); Performance (cs.PF); Systems and Control (eess.SY)
关键词:bringing unified solutions, Video-language models, bringing unified, reasoning tasks, unified solutions
备注:
点击查看摘要
Abstract:Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
339. 【2603.00124】OrthoAI: A Lightweight Deep Learning Framework for Automated Biomechanical Analysis in Clear Aligner Orthodontics -- A Methodological Proof-of-Concept
链接:https://arxiv.org/abs/2603.00124
作者:Edouard Lansiaux,Margaux Leman,Mehdi Ammi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Clear aligner therapy, Align Technology, Clear aligner, planned tooth movements-typically, digitally planned tooth
备注:
点击查看摘要
Abstract:Clear aligner therapy now dominates orthodontics, yet clinician review of digitally planned tooth movements-typically via ClinCheck (Align Technology)-remains slow and error-prone. We present OrthoAI, an open-source proof-of-concept decision-support system combining lightweight 3D dental segmentation with automated biomechanical analysis to assist treatment-plan evaluation. The framework uses a Dynamic Graph CNN trained on landmark-reconstructed point clouds from 3DTeethLand (MICCAI) and integrates a rule-based biomechanical engine grounded in orthodontic evidence (Kravitz et al 2009; Simon et al 2014). The system decomposes per-tooth motion across six degrees of freedom, computes movement-specific predictability, issues alerts when biomechanical limits are exceeded, and derives an exploratory composite index. With 60,705 trainable parameters, segmentation reaches a Tooth Identification Rate of $81.4\%$ and mIoU of $8.25\%$ on surrogate point clouds-reflecting sparse landmark supervision rather than dense meshes. Although spatial boundaries are coarse, downstream analysis depends mainly on tooth identity and approximate centroid/axis estimation. Results establish a baseline for future full-mesh training and highlight current perceptual limits. The end-to-end pipeline runs in $4s$ on consumer hardware. Code, weights, and analysis tools are released to support reproducible research in geometric deep learning and digital orthodontics. The system has not been validated on real intraoral meshes and should not be assumed to generalize beyond landmark-derived representations.
340. 【2603.00123】CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers
链接:https://arxiv.org/abs/2603.00123
作者:Yannian Gu,Xizhuo Zhang,Linjie Mu,Yongrui Yu,Zhongzhen Huang,Shaoting Zhang,Xiaofan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:visual question answering, Large Vision-Language Models, shown strong potential, Recent advances, advances in Large
备注: submitting to ACL 2026
点击查看摘要
Abstract:Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
341. 【2603.00122】NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
链接:https://arxiv.org/abs/2603.00122
作者:Aman Ulla
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:retrieval-augmented generation, important step, step before retrieval-augmented, downstream generative, Document extraction
备注: 17 pages, 10 figures, 5 tables
点击查看摘要
Abstract:Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.
342. 【2603.00119】BiSe-Unet: A Lightweight Dual-path U-Net with Attention-refined Context for Real-time Medical Image Segmentation
链接:https://arxiv.org/abs/2603.00119
作者:M Iffat Hossain,Laura Brattain
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:image-guided procedures, Abstract, lightweight, Kvasir-Seg dataset, procedures
备注: Submitted to IEEE EMBC 2026. This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:During image-guided procedures, real-time image segmentation is often required. This demands lightweight AI models that can operate on resource-constrained devices. One important use case is endoscopy-guided colonoscopy, where polyps must be detected in real time. The Kvasir-Seg dataset, a publicly available benchmark for this task, contains 1,000 high-resolution endoscopic images of polyps with corresponding pixel-level segmentation masks. Achieving real-time inference speed for clinical deployment in constrained environments requires highly efficient and lightweight network architectures. However, many existing models remain too computationally intensive for embedded deployment. Lightweight architectures, although faster, often suffer from reduced spatial precision and weaker contextual understanding, leading to degraded boundary quality and reduced diagnostic reliability. To address these challenges, we introduce BiSe-UNet, a lightweight dual-path U-Net that integrates an attention-refined context path with a shallow spatial path for detailed feature preservation, followed by a depthwise separable decoder for efficient reconstruction. Evaluated on the Kvasir-Seg dataset, BiSe-UNet achieves competitive Dice and IoU scores while sustaining real-time throughput exceeding 30 FPS on Raspberry Pi 5, demonstrating its effectiveness for accurate, lightweight, and deployable medical image segmentation on edge hardware.
343. 【2603.00118】Efficient Image Super-Resolution with Multi-Scale Spatial Adaptive Attention Networks
链接:https://arxiv.org/abs/2603.00118
作者:Sushi Rao,Jingwei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Spatial Adaptive Attention, Multi-scale Spatial Adaptive, Adaptive Attention Module, Adaptive Attention Network, high reconstruction fidelity
备注:
点击查看摘要
Abstract:This paper introduces a lightweight image super-resolution (SR) network, termed the Multi-scale Spatial Adaptive Attention Network (MSAAN), to address the common dilemma between high reconstruction fidelity and low model complexity in existing SR methods. The core of our approach is a novel Multi-scale Spatial Adaptive Attention Module (MSAA), designed to jointly model fine-grained local details and long-range contextual dependencies. The MSAA comprises two synergistic components: a Global Feature Modulation Module (GFM) that learns coherent texture structures through differential feature extraction, and a Multi-scale Feature Aggregation Module (MFA) that adaptively fuses features from local to global scales using pyramidal processing. To further enhance the network's capability, we propose a Local Enhancement Block (LEB) to strengthen local geometric perception and a Feature Interactive Gated Feed-Forward Module (FIGFF) to improve nonlinear representation while reducing channel redundancy. Extensive experiments on standard benchmarks (Set5, Set14, B100, Urban100, Manga109) across $\times2$, $\times3$, and $\times4$ scaling factors demonstrate that both our lightweight (MSAAN-light) and standard (MSAAN) versions achieve superior or competitive performance in terms of PSNR and SSIM, while maintaining significantly lower parameters and computational costs than state-of-the-art methods. Ablation studies validate the contribution of each component, and visual results show that MSAAN reconstructs sharper edges and more realistic textures.
344. 【2603.00116】VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation
链接:https://arxiv.org/abs/2603.00116
作者:Takumi Hachimine,Yuhwan Kwon,Cheng-Yu Kuo,Tomoya Yamanokuchi,Takamitsu Matsubara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:target internal part, target internal, Non-destructive extraction, internal part, cutting surrounding structures
备注: 11 pages
点击查看摘要
Abstract:Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part's presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.
345. 【2603.00114】Automated Quality Check of Sensor Data Annotations
链接:https://arxiv.org/abs/2603.00114
作者:Niklas Freund,Zekiye Ilknur-Öz,Tobias Klockau,Patrick Naumann,Philipp Neumaier,Martin Köppel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:track environment plays, automation level Grade, plays an important, important role, automation level
备注:
点击查看摘要
Abstract:The monitoring of the route and track environment plays an important role in automated driving. For example, it can be used as an assistance system for route monitoring in automation level Grade of Automation (GoA) 2, where the train driver is still on board. In fully automated, driverless driving at automation level GoA4, these systems finally take over environment monitoring completely independently. With the help of artificial intelligence (AI), they react automatically to risks and dangerous events on the route. To train such AI algorithms, large amounts of training data are required, which must meet high-quality standards due to their safety relevance. In this publication we present an automatic method for assuring the quality of training data, significantly reducing the manual workload and accelerating the development of these systems. We propose an open-source tool designed to detect nine common errors found in multi-sensor datasets for railway vehicles. To evaluate the performance of the framework, all detected errors were manually validated. Six issue detection methods achieved 100% precision, while three additional methods reached precision rates 96% and 97%.
346. 【2603.00070】Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems
链接:https://arxiv.org/abs/2603.00070
作者:Datorien L. Anderson
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:confident incorrect prediction, Standard evaluation metrics, machine learning, errors are equivalent, evaluation metrics
备注: 18 pages, 1 figure, full experiment data can be found: [this https URL](https://zenodo.org/records/18530003)
点击查看摘要
Abstract:Standard evaluation metrics for machine learning -- accuracy, precision, recall, and AUROC -- assume that all errors are equivalent: a confident incorrect prediction is penalized identically to an uncertain one. For discrete commitment systems (architectures that select committed states {-W, 0, +W}), this assumption is epistemologically flawed. We introduce the Certainty-Validity (CVS) Framework, a diagnostic method that decomposes model performance into a 2x2 matrix distinguishing high/low certainty from valid/invalid predictions. This framework reveals a critical failure mode hidden by standard accuracy: Confident-Incorrect (CI) behavior, where models hallucinate structure in ambiguous data. Through ablation experiments on Fashion-MNIST, EMNIST, and IMDB, we analyze the "83% Ambiguity Ceiling" -- a stopping point where this specific discrete architecture consistently plateaus on noisy benchmarks. Unlike continuous models that can surpass this ceiling by memorizing texture or statistical noise, the discrete model refuses to commit to ambiguous samples. We show that this refusal is not a failure but a feature: the model stops where structural evidence ends. However, standard training on ambiguous data eventually forces Benign Overfitting, causing a pathological migration from Uncertain-Incorrect (appropriate doubt) to Confident-Incorrect (hallucination). We propose that "good training" for reasoning systems must be defined not by accuracy, but by maximizing the Certainty-Validity Score (CVS) -- ensuring the model knows where to stop.
347. 【2603.00060】Learning Under Extreme Data Scarcity: Subject-Level Evaluation of Lightweight CNNs for fMRI-Based Prodromal Parkinsons Detection
链接:https://arxiv.org/abs/2603.00060
作者:Naimur Rahman
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:prodromal Parkinsons, difficult to obtain, reflect these constraints, applied in settings, prodromal Parkinsons disease
备注: Methodological case study cs.LG on subject-level evaluation and model capacity under extreme data scarcity; 9 pages, 1 figure. Experiments use 40-subject PPMI fMRI cohort; no external validation
点击查看摘要
Abstract:Deep learning is often applied in settings where data are limited, correlated, and difficult to obtain, yet evaluation practices do not always reflect these constraints. Neuroimaging for prodromal Parkinsons disease is one such case, where subject numbers are small and individual scans produce many highly related samples. This work examines prodromal Parkinsons detection from resting-state fMRI as a machine learning problem centered on learning under extreme data scarcity. Using fMRI data from 40 subjects, including 20 prodromal Parkinsons cases and 20 healthy controls, ImageNet-pretrained convolutional neural networks are fine-tuned and evaluated under two different data partitioning strategies. Results show that commonly used image-level splits allow slices from the same subject to appear in both training and test sets, leading to severe information leakage and near-perfect accuracy. When a strict subject-level split is enforced, performance drops substantially, yielding test accuracies between 60 and 81 percent. Models with different capacity profiles are compared, including VGG19, Inception V3, Inception ResNet V2, and the lightweight MobileNet V1. Under subject-level evaluation, MobileNet demonstrates the most reliable generalization, outperforming deeper architectures despite having significantly fewer parameters. These results indicate that in extreme low-data regimes, evaluation strategy and model capacity have a greater impact on performance than architectural depth. Although the analysis is limited to a single cohort of 40 subjects and does not include external validation or cross-validation, it provides a concrete case study and practical recommendations for evaluating deep learning models under severe data scarcity.
Comments:
Methodological case study cs.LG on subject-level evaluation and model capacity under extreme data scarcity; 9 pages, 1 figure. Experiments use 40-subject PPMI fMRI cohort; no external validation
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2603.00060 [cs.CV]
(or
arXiv:2603.00060v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.00060
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
348. 【2502.16612】MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
链接:https://arxiv.org/abs/2502.16612
作者:Mohamed Bayan Kmainasi,Abul Hasnat,Md Arid Hasan,Ali Ezzat Shahroor,Firoj Alam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:social media presents, media presents significant, presents significant challenges, hate speech, moderating complex
备注: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, hateful meme, multimodality, text, images
点击查看摘要
Abstract:The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (this https URL).
349. 【2603.01449】Revisiting Global Token Mixing in Task-Dependent MRI Restoration: Insights from Minimal Gated CNN Baselines
链接:https://arxiv.org/abs/2603.01449
作者:Xiangjian Hou,Chao Qin,Chang Ni,Xin Wang,Chun Yuan,Xiaodong Ma
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Global token mixing, MRI restoration, MRI, state-space sequence models, popular model design
备注:
点击查看摘要
Abstract:Global token mixing, implemented via self-attention or state-space sequence models, has become a popular model design choice for MRI restoration. However, MRI restoration tasks differ substantially in how their degradations vary over image and k-space domains, and in the degree to which global coupling is already imposed by physics-driven data consistency terms. In this work, we ask the question whether global token mixing is actually beneficial in each individual task across three representative settings: accelerated MRI reconstruction with explicit data consistency, MRI super-resolution with k-space center cropping, and denoising of clinical carotid MRI data with spatially heteroscedastic noise. To reduce confounding factors, we establish a controlled testbed comparing a minimal local gated CNN and its large-field variant, benchmarking them directly against state-of-the-art global models under aligned training and evaluation protocols. For accelerated MRI reconstruction, the minimal unrolled gated-CNN baseline is already highly competitive compared to recent token-mixing approaches in public reconstruction benchmarks, suggesting limited additional benefits when the forward model and data-consistency steps provide strong global constraints. For super-resolution, where low-frequency k-space data are largely preserved by the controlled low-pass degradation, local gated models remain competitive, and a lightweight large-field variant yields only modest improvements. In contrast, for denoising with pronounced spatially heteroscedastic noise, token-mixing models achieve the strongest overall performance, consistent with the need to estimate spatially varying reliability. In conclusion, our results demonstrate that the utility of global token mixing in MRI restoration is task-dependent, and it should be tailored to the underlying imaging physics and degradation structure.
350. 【2603.00882】Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
链接:https://arxiv.org/abs/2603.00882
作者:Zhangxing Bian,Shuwen Wei,Samuel W. Remedios,Junyu Chen,Aaron Carass,Blake E. Dewey,Jerry L. Prince
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:tissue motion non-invasively, internal tissue motion, tracking internal tissue, internal tissue, motion
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Tagged MRI enables tracking internal tissue motion non-invasively. It encodes motion by modulating anatomy with periodic tags, which deform along with tissue. However, the entanglement between anatomy, tags and motion poses significant challenges for post-processing. The existence of tags and imaging blur hinders downstream tasks such as segmenting anatomy. Tag fading, due to T1-relaxation, disrupts the brightness constancy assumption for motion tracking. For decades, these challenges have been handled in isolation and sub-optimally. In contrast, we introduce a blind and nonlinear inverse framework for tagged MRI that, for the first time, unifies these tasks: anatomical image recovery, high-resolution cine image synthesis, and motion estimation. At its core, the synergy of MR physics and generative priors enables us to blindly estimate the unknown forward imaging models and high-resolution underlying anatomy, while simultaneously tracking 3D diffeomorphic Lagrangian motion over time. Experiments on tagged brain MRI demonstrate that our approach yields high-resolution anatomy images, cine images, and more accurate motion than specialized methods.
351. 【2603.00798】Efficient Conformal Volumetry for Template-Based Segmentation
链接:https://arxiv.org/abs/2603.00798
作者:Matt Y. Cheung,Ashok Veeraraghavan,Guha Balakrishnan
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
关键词:propagates anatomical labels, compute volumetric biomarkers, propagates anatomical, downstream decision-making, widely used paradigm
备注:
点击查看摘要
Abstract:Template-based segmentation, a widely used paradigm in medical imaging, propagates anatomical labels via deformable registration from a labeled atlas to a target image, and is often used to compute volumetric biomarkers for downstream decision-making. While conformal prediction (CP) provides finite-sample valid intervals for scalar metrics, existing segmentation-based uncertainty quantification (UQ) approaches either rely on learned model features, often unavailable in classic template-based pipelines, or treat the registration process as a black box, resulting in overly conservative intervals when applied directly in output space. We introduce ConVOLT, a CP framework that achieves efficient volumetric UQ by conditioning calibration on properties of the estimated deformation field from template-based segmentation. ConVOLT calibrates a learned volumetric scaling factor from deformation space features. We evaluate ConVOLT on template-based segmentation tasks involving global, regional, and label volumetry across multiple datasets and registration methods. ConVOLT achieves target coverage while producing substantially tighter intervals than output-space conformal baselines. Our work paves way to exploit the registration process for efficient UQ in medical imaging pipelines.
352. 【2603.00233】Scaling Quantum Machine Learning without Tricks: High-Resolution and Diverse Image Generation
链接:https://arxiv.org/abs/2603.00233
作者:Jonas Jäger,Florian J. Kiwit,Carlos A. Riofrío
类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:rapidly evolving discipline, Quantum generative modeling, machine learning, generative modeling, rapidly evolving
备注: 25 pages, 16 figures. Main text: 14 pages, 7 figures. Appendix: 11 pages, 9 figures
点击查看摘要
Abstract:Quantum generative modeling is a rapidly evolving discipline at the intersection of quantum computing and machine learning. Contemporary quantum machine learning is generally limited to toy examples or heavily restricted datasets with few elements. This is not only due to the current limitations of available quantum hardware but also due to the absence of inductive biases arising from application-agnostic designs. Current quantum solutions must resort to tricks to scale down high-resolution images, such as relying heavily on dimensionality reduction or utilizing multiple quantum models for low-resolution image patches. Building on recent developments in classical image loading to quantum computers, we circumvent these limitations and train quantum Wasserstein GANs on the established classical MNIST and Fashion-MNIST datasets. Using the complete datasets, our system generates full-resolution images across all ten classes and establishes a new state-of-the-art performance with a single end-to-end quantum generator without tricks. As a proof-of-principle, we also demonstrate that our approach can be extended to color images, exemplified on the Street View House Numbers dataset. We analyze how the choice of variational circuit architecture introduces inductive biases, which crucially unlock this performance. Furthermore, enhanced noise input techniques enable highly diverse image generation while maintaining quality. Finally, we show promising results even under quantum shot noise conditions.
353. 【2603.00218】GLIDE-Reg: Global-to-Local Deformable Registration Using Co-Optimized Foundation and Handcrafted Features
链接:https://arxiv.org/abs/2603.00218
作者:Yunzheng Zhu,Aichi Chien,Kimaya kulkarni,Luoting Zhuang,Stephen Park,Ricky Savjani,Daniel Low,William Hsu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:medical imaging, Deformable registration, crucial in medical, Deformable, probabilistic atlas generation
备注:
点击查看摘要
Abstract:Deformable registration is crucial in medical imaging. Several existing applications include lesion tracking, probabilistic atlas generation, and treatment response evaluation. However, current methods often lack robustness and generalizability across two key factors: spatial resolution and differences in anatomical coverage. We jointly optimize a registration field and a learnable dimensionality reduction module so that compressed VFM embeddings remain registration-relevant, and fuse these global semantic cues with MIND local descriptors. GLIDE-Reg achieves average dice similarity coefficients (DSC) across 6 anatomical structures of 0.859, 0.862, and 0.901 in two public cohorts (Lung250M and NLST) and one institution cohort (UCLA5DCT), and outperforms the state-of-the-art DEEDS (0.834, 0.858, 0.900) with relative improvements of 3.0%, 0.5%, and 0.1%. For target registration errors, GLIDE-Reg achieves 1.58 mm on Lung250M landmarks (compared to 1.25 mm on corrField and 1.91 mm on DEEDS) and 1.11 mm on NLST nodule centers (compared to 1.11 mm on DEEDS). The substantiated performance on the nodule centers also demonstrates its robustness across challenging downstream tasks, such as nodule tracking, which is an essential prior step for early-stage lung cancer diagnosis.
354. 【2603.00205】Efficient Flow Matching for Sparse-View CT Reconstruction
链接:https://arxiv.org/abs/2603.00205
作者:Jiayang Shi,Lincen Yang,Zhong Li,Tristan Van Leeuwen,Daniel M. Pelt,K. Joost Batenburg
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Computed Tomography, ill-posed inverse problems, solving ill-posed inverse, potential for Computed, Ordinary Differential Equation
备注:
点击查看摘要
Abstract:Generative models, particularly Diffusion Models (DM), have shown strong potential for Computed Tomography (CT) reconstruction serving as expressive priors for solving ill-posed inverse problems. However, diffusion-based reconstruction relies on Stochastic Differential Equations (SDEs) for forward diffusion and reverse denoising, where such stochasticity can interfere with repeated data consistency corrections in CT reconstruction. Since CT reconstruction is often time-critical in clinical and interventional scenarios, improving reconstruction efficiency is essential. In contrast, Flow Matching (FM) models sampling as a deterministic Ordinary Differential Equation (ODE), yielding smooth trajectories without stochastic noise injection. This deterministic formulation is naturally compatible with repeated data consistency operations. Furthermore, we observe that FM-predicted velocity fields exhibit strong correlations across adjacent steps. Motivated by this, we propose an FM-based CT reconstruction framework (FMCT) and an efficient variant (EFMCT) that reuses previously predicted velocity fields over consecutive steps to substantially reduce the number of Neural network Function Evaluations (NFEs), thereby improving inference efficiency. We provide theoretical analysis showing that the error introduced by velocity reuse is bounded when combined with data consistency operations. Extensive experiments demonstrate that FMCT/EFMCT achieve competitive reconstruction quality while significantly improving computational efficiency compared with diffusion-based methods. The codebase is open-sourced at this https URL.
355. 【2603.00204】Optimisation of SOUP-GAN and CSR-GAN for High Resolution MR Images Reconstruction
链接:https://arxiv.org/abs/2603.00204
作者:Muneeba Rashid,Hina Shakir,Humaira Mehwish,Asarim Amir,Reema Qaiser Khan
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Magnetic Resonance, Generative Adversarial Networks, efficient Generative Adversarial, modern medicine, limited by equipment
备注:
点击查看摘要
Abstract:Magnetic Resonance (MR) imaging is a diagnostic tool used in modern medicine; however, its output can be affected by motion artefacts and may be limited by equipment. This research focuses on MRI image quality enhancement using two efficient Generative Adversarial Networks (GANs) models: SOUP-GAN and CSR-GAN. In both models, meaningful architectural modifications were introduced. The generator and discriminator of each were further deepened by adding convolutional layers and were enhanced in filter sizes as well. The LeakyReLU activation function was used to improve gradient flow, and hyperparameter tuning strategies were applied, including a reduced learning rate and an optimal batch size. Moreover, spectral normalisation was proposed to address mode collapse and improve training stability. The experiment shows that CSR-GAN has better performance in reconstructing the image with higher frequency details and reducing noise compared to other methods, with an optimised PSNR of 34.6 and SSIM of 0.89. However, SOUP-GAN performed the best in terms of delivering less noisy images with good structures, achieving a PSNR of 34.4 and SSIM of 0.83. The obtained results indicate that the proposed enhanced GAN model can be a useful tool for MR image quality improvement for subsequent better disease diagnostics.
356. 【2603.00162】GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans
链接:https://arxiv.org/abs/2603.00162
作者:Joy T Wu,Daniel Beckmann,Sarah Miller,Alexander Lee,Elizabeth Theng,Stephan Altmayer,Ken Chang,David Kersting,Tomoaki Otani,Brittany Z Dashevsky,Hye Lim Park,Matteo Novello,Kip Guja,Curtis Langlotz,Ismini Lourentzou,Daniel Gruhl,Benjamin Risse,Guido A Davidzon
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:efficient diagnostic aids, treatment response assessment, reader shortages necessitate, expert reader shortages, cornerstone imaging modality
备注:
点击查看摘要
Abstract:[18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.
357. 【2603.00115】Multimodal Modular Chain of Thoughts in Energy Performance Certificate Assessment
链接:https://arxiv.org/abs/2603.00115
作者:Zhen Peng,Peter J. Bentley
类目:Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Energy Performance Certificate, building energy performance, energy performance remains, scalable Energy Performance, performance remains challenging
备注:
点击查看摘要
Abstract:Accurate evaluation of building energy performance remains challenging in regions where scalable Energy Performance Certificate (EPC) assessments are unavailable. This paper presents a cost-efficient framework that leverages Vision-Language models for automated EPC pre-assessment from limited visual information. The proposed Multimodal Modular Chain of Thoughts (MMCoT) architecture decomposes EPC estimation into intermediate reasoning stages and explicitly propagates inferred attributes across tasks using structured prompting. Experiments on a multimodal dataset of 81 residential properties in the United Kingdom show that MMCoT achieves statistically significant improvements over instruction-only prompting for EPC estimation. Analysis based on accuracy, recall, mean absolute error, and confusion matrices indicate that the proposed approach captures the ordinal structure of EPC ratings, with most errors occurring between adjacent classes. These results suggest that modular prompt-based reasoning offers a promising direction for low-cost EPC pre-assessment in data-scarce settings.



