本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新645篇论文,其中:
- 自然语言处理80篇
- 信息检索7篇
- 计算机视觉151篇
自然语言处理
1. 【2603.12252】EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
链接:https://arxiv.org/abs/2603.12252
作者:Xuanlang Dai,Yujie Zhou,Long Xing,Jiazi Bu,Xilin Wei,Yuhong Liu,Beichen Zhang,Kai Chen,Yuhang Zang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
备注: 23 pages, 18 figures
点击查看摘要
Abstract:Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
2. 【2603.12249】SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
链接:https://arxiv.org/abs/2603.12249
作者:Ziyu Chen,Yilun Zhao,Chengye Wang,Rilyn Han,Manasi Patwardhan,Arman Cohan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Constructing scientific multimodal, trade-off among scale, Constructing scientific, involves an inherent, inherent trade-off
备注:
点击查看摘要
Abstract:Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
3. 【2603.12246】Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
链接:https://arxiv.org/abs/2603.12246
作者:Yixin Liu,Yue Yu,DiJia Su,Sid Wang,Xuewei Wang,Song Jiang,Bo Liu,Arman Cohan,Yuandong Tian,Zhengxing Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:inference-time scaling, directly checked, reasoning judges, Reasoning, benefit from inference-time
备注:
点击查看摘要
Abstract:Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.
4. 【2603.12226】Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration
链接:https://arxiv.org/abs/2603.12226
作者:Priyanka Kargupta,Shuhaib Mehri,Dilek Hakkani-Tur,Jiawei Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:single-domain academic silos, work remains confined, academic silos, leading to larger, larger and longer-term
备注: Code and dataset provided at [this https URL](https://github.com/pkargupta/idea_catalyst)
点击查看摘要
Abstract:Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.
5. 【2603.12206】CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks
链接:https://arxiv.org/abs/2603.12206
作者:Alexandre Le Mercier,Thomas Demeester,Chris Develder
类目:Computation and Language (cs.CL)
关键词:achieving linear complexity, Hidden State Poisoning, gained significant traction, alternatives to Transformers, State Poisoning Attacks
备注: 22 pages, 6 figures
点击查看摘要
Abstract:State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at this https URL.
6. 【2603.12201】IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
链接:https://arxiv.org/abs/2603.12201
作者:Yushi Bai,Qian Dong,Ting Jiang,Xin Lv,Zhengxiao Du,Aohan Zeng,Jie Tang,Juanzi Li
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Long-context agentic workflows, making attention efficiency, attention efficiency critical, Long-context agentic, serving cost
备注:
点击查看摘要
Abstract:Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
7. 【2603.12191】Long-Context Encoder Models for Polish Language Understanding
链接:https://arxiv.org/abs/2603.12191
作者:Sławomir Dadas,Rafał Poświata,Marek Kozłowski,Małgorzata Grębowiec,Michał Perełkiewicz,Paweł Klimiuk,Przemysław Boruta
类目:Computation and Language (cs.CL)
关键词:decoder-only Large Language, Large Language Models, Large Language, encoder-only architectures remain, NLP landscape
备注:
点击查看摘要
Abstract:While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.
8. 【2603.12180】Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
链接:https://arxiv.org/abs/2603.12180
作者:Łukasz Borchmann,Jordy Van Landeghem,Michał Turski,Shreyansh Padarha,Ryan Othniel Kearns,Adam Mahdi,Niels Rogge,Clémentine Fourrier,Siwei Han,Huaxiu Yao,Artemis Llabrés,Yiming Xu,Dimosthenis Karatzas,Hao Zhang,Anupam Datta
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Multimodal agents offer, complex document-intensive workflows, automating complex document-intensive, Multimodal agents, document-intensive workflows
备注:
点击查看摘要
Abstract:Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
9. 【2603.12165】QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
链接:https://arxiv.org/abs/2603.12165
作者:Jiayin Lei,Ming Ma,Yunxi Duan,Chenxi Li,Tianming Yang
类目:Computation and Language (cs.CL)
关键词:code generation models, introduces significant noise, training code generation, Synthetic data, code generation
备注: 12 pages, 5 figures. Under review at ACL 2026
点击查看摘要
Abstract:Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.
10. 【2603.12152】LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
链接:https://arxiv.org/abs/2603.12152
作者:Feiyu Duan,Xuanjing Huang,Zhongyu Wei
类目:Computation and Language (cs.CL)
关键词:large language models, rapid advancement, advancement of large, large language, accelerated progress
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.
11. 【2603.12149】Linking Perception, Confidence and Accuracy in MLLMs
链接:https://arxiv.org/abs/2603.12149
作者:Yuetian Du,Yucheng Wang,Rongyu Zhang,Zhijie Xu,Boyu Yang,Ming Kong,Jie Liu,Qiang Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Recent advances
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
12. 【2603.12133】opoBench: Benchmarking LLMs on Hard Topological Reasoning
链接:https://arxiv.org/abs/2603.12133
作者:Mayug Maniparambil,Nils Hoehing,Janak Kapuriya,Arjun Karuvally,Ellen Rushe,Anthony Ventresque,Noel O'Connor,Fergal Reid
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Solving topological grid, powerful large language, Solving topological, large language models, global spatial invariants
备注: Accepted, Workshop on Logical Reasoning of Large Language Models at ICLR 2026
点击查看摘要
Abstract:Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at this http URL.
13. 【2603.12123】Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
链接:https://arxiv.org/abs/2603.12123
作者:Tae-Eun Song
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, language models struggle, struggle to catch, review
备注: 10 pages, 2 figures, 8 tables
点击查看摘要
Abstract:Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.
14. 【2603.12117】SommBench: Assessing Sommelier Expertise of Language Models
链接:https://arxiv.org/abs/2603.12117
作者:William Brach,Tomas Bedej,Jacob Nielsen,Jacob Pichna,Juraj Bedej,Eemeli Saarensilta,Julie Dupouy,Gianluca Barmina,Andrea Blasi Núñez,Peter Schneider-Kamp,Kristian Košťál,Michal Ries,Lukas Galke Poech
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:multicultural capabilities, Theory Question Answering, rapid advances, advances of large, increasingly important
备注:
点击查看摘要
Abstract:With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at this https URL.
15. 【2603.12105】o Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times
链接:https://arxiv.org/abs/2603.12105
作者:Thomas Hikaru Clark,Carlos Arriaga,Javier Conde,Gonzalo Martínez,Pedro Reviriego
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, multiword expressions, recently been shown
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.
16. 【2603.12094】Human-Centred LLM Privacy Audits: Findings and Frictions
链接:https://arxiv.org/abs/2603.12094
作者:Dimitri Staufer,Kirsten Morehouse,David Hartmann,Bettina Berendt
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large language models, massive training corpora, Large language, learn statistical associations, learn statistical
备注:
点击查看摘要
Abstract:Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies ($N_{total}{=}458$), GPT-4o predicts 11 of 50 features for everyday people with $\ge$60\% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model--individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.
17. 【2603.12056】XSkill: Continual Learning from Experience and Skills in Multimodal Agents
链接:https://arxiv.org/abs/2603.12056
作者:Guanyu Jiang,Zhaochen Su,Xiaoye Qu,Yi R.(May)Fung
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:tackle complex reasoning, complex reasoning tasks, open-ended settings, tackle complex, suffer from inefficient
备注:
点击查看摘要
Abstract:Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.
18. 【2603.12050】ranslationese as a Rational Response to Translation Task Difficulty
链接:https://arxiv.org/abs/2603.12050
作者:Maria Kunilovskaya
类目:Computation and Language (cs.CL)
关键词:phenomenon widely referred, Translations systematically diverge, texts originally produced, translation task, translation task difficulty
备注: 17 pages, submitted to ARR March 2026
点击查看摘要
Abstract:Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.
19. 【2603.12021】Just Use XML: Revisiting Joint Translation and Label Projection
链接:https://arxiv.org/abs/2603.12021
作者:Thennal D K,Chris Biemann,Hans Ole Hatzel
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:extending span-annotated datasets, Label projection, extending span-annotated, span-annotated datasets, Label
备注:
点击查看摘要
Abstract:Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +39.9 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.
20. 【2603.11991】BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
链接:https://arxiv.org/abs/2603.11991
作者:Ilias Aarab
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:human-readable label descriptions, eliminating costly task-specific, costly task-specific annotation, matching texts directly, offers the promise
备注: Accepted at ICLR 2026. 31 pages, 5 figures, 9 tables. Code: [this https URL](https://github.com/IliasAarab/btzsc) ; Dataset: [this https URL](https://huggingface.co/datasets/btzsc/btzsc) ; Leaderboard: [this https URL](https://huggingface.co/spaces/btzsc/btzsc-leaderboard) . Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026), 2026
点击查看摘要
Abstract:Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
21. 【2603.11957】CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
链接:https://arxiv.org/abs/2603.11957
作者:Pranav Raikote,Korbinian Randl,Ioanna Miliou,Athanasios Lakes,Panagiotis Papapetrou
类目:Computation and Language (cs.CL)
关键词:Scaling educational assessment, language models requires, large language models, educational assessment, assessment with large
备注:
点击查看摘要
Abstract:Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK = 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.
22. 【2603.11955】PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
链接:https://arxiv.org/abs/2603.11955
作者:Minjia Wang,Yunfeng Wang,Xiao Ma,Dexin Lv,Qifan Guo,Lynn Zheng,Benliang Wang,Lei Wang,Jiannan Li,Yongwei Xing,David Xu,Zheng Sun
类目:Computation and Language (cs.CL)
关键词:developing personalized applications, training machine learning, machine learning models, records of individuals', studying behavior
备注: EACL 2026 Industry Track
点击查看摘要
Abstract:Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.
23. 【2603.11947】Resurfacing Paralinguistic Awareness in Large Audio Language Models
链接:https://arxiv.org/abs/2603.11947
作者:Hao Yang,Minghan Wang,Tongtong Wu,Lizhen Qu,Ehsan Shareghi,Gholamreza Haffari
类目:ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
关键词:Large Audio Language, Audio Language Models, Large Audio, Language Models, great interactive potential
备注: Submitted to Interspeech 2026
点击查看摘要
Abstract:Large Audio Language Models (LALMs) have expanded the interaction with human to speech modality, which introduces great interactive potential, due to the paralinguistic cues implicitly indicating the user context. However, building on the current content-centred paradigm, LALMs usually neglect such paralinguistic cues and respond solely based on query content. In this work, to resurface the paralinguistic awareness in LALMs, we introduce five diverse layer-wise analyses to jointly identify paralinguistic layers and semantic understanding layers. Based on these insights, we propose a paralinguistic-enhanced fine-tuning (PE-FT) protocol accordingly to equip LALMs with paralinguistic-aware capabilities, including (1) selective-layer fine-tuning, and (2) an auxiliary dual-level classification head. Our experiments demonstrate that PE-FT protocol efficiently and effectively resurfaces the paralinguistic awareness, even surpassing the performance of the all-layer fine-tuning strategy.
24. 【2603.11924】Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding
链接:https://arxiv.org/abs/2603.11924
作者:Xinyu Li,Zhen Zhang,Qi Chen,Anton van den Hengel,Lina Yao,Javen Qinfeng Shi
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Existing chemical understanding, tasks primarily rely, static molecular representations, Existing chemical, understanding tasks primarily
备注: 18 pages
点击查看摘要
Abstract:Existing chemical understanding tasks primarily rely on static molecular representations, limiting their ability to model inherently dynamic phenomena such as bond breaking or conformational changes, which are essential for a chemist to understand chemical reactions. To address this gap, we introduce Chemical Dynamics Understanding (ChemDU), a new task that translates 4D molecular trajectories into interpretable natural-language explanations. ChemDU focuses on fundamental dynamic scenarios, including gas-phase and catalytic reactions, and requires models to reason about key events along molecular trajectories, such as bond formation and dissociation, and to generate coherent, mechanistically grounded narratives. To benchmark this capability, we construct Chem4DBench, the first dataset pairing 4D molecular trajectories with expert-authored explanations across these settings. We further propose Chem4DLLM, a unified model that integrates an equivariant graph encoder with a pretrained large language model to explicitly capture molecular geometry and rotational dynamics. We hope that ChemDU, together with Chem4DBench and Chem4DLLM, will stimulate further research in dynamic chemical understanding and multimodal scientific reasoning.
25. 【2603.11915】CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
链接:https://arxiv.org/abs/2603.11915
作者:Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan
类目:Computation and Language (cs.CL)
关键词:human social intelligence, Large Language Models, Theory of Mind, Mind Booklet Task, ability to reason
备注:
点击查看摘要
Abstract:Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.
26. 【2603.11896】hink While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.11896
作者:Lu Wang(1),Zhuoran Jin(1),Yupu Hao(1),Yubo Chen(1),Kang Liu(1),Yulong Ao(2),Jun Zhao(1) ((1) The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, (2) Beijing Academy of Artificial Intelligence (BAAI), Beijing, China)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal large language, large language models, offline video understanding, Multimodal large, continuously arriving video
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: this https URL
27. 【2603.11881】Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language
链接:https://arxiv.org/abs/2603.11881
作者:Remigiusz Kinas,Paweł Kiszczak,Sergio P. Perez,Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:optimized for European, NVIDIA Minitron approach, NVIDIA Model Optimizer, specifically optimized, European languages
备注:
点击查看摘要
Abstract:This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.
28. 【2603.11838】DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining
链接:https://arxiv.org/abs/2603.11838
作者:Yutong Yan,Raphael Tang,Zhenyu Gao,Wenxi Jiang,Yao Lu
类目:Computation and Language (cs.CL); General Finance (q-fin.GN)
关键词:risk introducing lookahead, introducing lookahead bias, internet-scale data risk, data risk introducing, large language models
备注:
点击查看摘要
Abstract:In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.
29. 【2603.11781】From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts
链接:https://arxiv.org/abs/2603.11781
作者:Sunil Prakash
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:Multi-agent LLM systems, Multi-agent LLM, LLM systems increasingly, increasingly tackle complex, interaction patterns remain
备注: 26 pages, 6 tables, 2 figures, 2 listings
点击查看摘要
Abstract:Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration. None model deliberation: a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on accountable outcomes. We introduce Deliberative Collective Intelligence (DCI), specifying four reasoning archetypes, 14 typed epistemic acts, a shared workspace, and DCI-CF, a convergent flow algorithm that guarantees termination with a structured decision packet containing the selected option, residual objections, minority report, and reopen conditions. We evaluate on 45 tasks across seven domains using Gemini 2.5 Flash. On non-routine tasks (n=40), DCI significantly improves over unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden-profile tasks requiring perspective integration (9.56, highest of any system on any domain) while failing on routine decisions (5.39), confirming task-dependence. DCI produces 100% structured decision packets and 98% minority reports, artifacts absent from all baselines. However, DCI consumes ~62x single-agent tokens, and single-agent generation outperforms DCI on overall quality. DCI's contribution is not that more agents are better, but that consequential decisions benefit from deliberative structure when process accountability justifies the cost.
30. 【2603.11780】Large Language Models for Biomedical Article Classification
链接:https://arxiv.org/abs/2603.11780
作者:Jakub Proboszcz,Paweł Cichosz
类目:Computation and Language (cs.CL)
关键词:biomedical article classification, large language models, presents a systematic, systematic and in-depth, in-depth investigation
备注: 63 pages, 25 tables, 4 figures
点击查看摘要
Abstract:This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.
31. 【2603.11778】rust Oriented Explainable AI for Fake News Detection
链接:https://arxiv.org/abs/2603.11778
作者:Krzysztof Siwek,Daniel Stankowski,Maciej Stodolski
类目:Computation and Language (cs.CL)
关键词:Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, application of Explainable, compares selected interpretability
备注: 9 pages, 4 figures, 2 tables
点击查看摘要
Abstract:This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.
32. 【2603.11772】Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents
链接:https://arxiv.org/abs/2603.11772
作者:Yaocong Li,Qiang Lan,Leihan Zhang,Le Zhang
类目:Computation and Language (cs.CL)
关键词:Chinese legal RAG, Retrieval-Augmented Generation, mainstream RAG systems, lack specialized support, Chinese legal
备注: 20 pages, 4 figures, to be submitted to a conference/journal
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at this https URL.
33. 【2603.11770】An Automatic Text Classification Method Based on Hierarchical Taxonomies, Neural Networks and Document Embedding: The NETHIC Tool
链接:https://arxiv.org/abs/2603.11770
作者:Luigi Lomasto,Rosario Di Florio,Andrea Ciapetti,Giuseppe Miscione,Giulia Ruggiero,Daniele Toti
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:classification method implemented, automatic text classification, text classification method, software tool called, highly-scalable neural networks
备注: ICEIS 2019 Conference
点击查看摘要
Abstract:This work describes an automatic text classification method implemented in a software tool called NETHIC, which takes advantage of the inner capabilities of highly-scalable neural networks combined with the expressiveness of hierarchical taxonomies. As such, NETHIC succeeds in bringing about a mechanism for text classification that proves to be significantly effective as well as efficient. The tool had undergone an experimentation process against both a generic and a domain-specific corpus, outputting promising results. On the basis of this experimentation, NETHIC has been now further refined and extended by adding a document embedding mechanism, which has shown improvements in terms of performance on the individual networks and on the whole hierarchical model.
34. 【2603.11749】Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information
链接:https://arxiv.org/abs/2603.11749
作者:Konstantin Krestnikov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:trained on mixed-quality, prefer correct statements, Consistency Principle, language models, correct statements
备注: v1: initial release. Full code, synthetic datasets and experiments available at [this https URL](https://github.com/Rai220/compression-drives-truth) This work was done independently
点击查看摘要
Abstract:Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at this https URL.
35. 【2603.11743】Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair
链接:https://arxiv.org/abs/2603.11743
作者:Assaf Siani,Anna Kernerman,Ilan Kernerman
类目:Computation and Language (cs.CL)
关键词:evaluate generated outputs, plays a crucial, crucial role, role in machine, serves to evaluate
备注:
点击查看摘要
Abstract:Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.
36. 【2603.11698】OSCBench: Benchmarking Object State Change in Text-to-Video Generation
链接:https://arxiv.org/abs/2603.11698
作者:Xianjing Han,Bin Zhu,Shiqi Hu,Franklin Mingzhe Li,Patrick Carrington,Roger Zimmermann,Jingjing Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:made rapid progress, producing visually high-quality, made rapid, rapid progress, progress in producing
备注: Project page: [this https URL](https://hanxjing.github.io/OSCBench)
点击查看摘要
Abstract:Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
37. 【2603.11687】SemBench: A Universal Semantic Framework for LLM Evaluation
链接:https://arxiv.org/abs/2603.11687
作者:Mikel Zubillaga,Naiara Perez,Oscar Sainz,German Rigau
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, Large Language Models, exhibit remarkable generative, Recent progress, progress in Natural
备注: Accepted at LREC 2026
点击查看摘要
Abstract:Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.
38. 【2603.11686】In the LLM era, Word Sense Induction remains unsolved
链接:https://arxiv.org/abs/2603.11686
作者:Anna Mosolova,Marie Candito,Carlos Ramisch
类目:Computation and Language (cs.CL)
关键词:word sense induction, word sense disambiguation, word sense, sense induction, sense disambiguation
备注: Accepted at ACL 2025 (Findings)
点击查看摘要
Abstract:In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.
Comments:
Accepted at ACL 2025 (Findings)
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2603.11686 [cs.CL]
(or
arXiv:2603.11686v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.11686
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
39. 【2603.11677】From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration
链接:https://arxiv.org/abs/2603.11677
作者:Gaole He,Brian Y. Lim
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, power autonomous agents, multi-step tasks
备注: CHI 2026 Workshop on Human-Agent Collaboration
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks. However, human-agent interaction remains pointwise and reactive: users approve or correct individual actions to mitigate immediate risks, without visibility into subsequent consequences. This forces users to mentally simulate long-term effects, a cognitively demanding and often inaccurate process. Users have control over individual steps but lack the foresight to make informed decisions. We argue that effective collaboration requires foresight, not just control. We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions. Simulation transforms intervention from reactive guesswork into informed exploration, while helping users discover latent constraints and preferences along the way. This perspective paper characterizes the limitations of current paradigms, introduces a conceptual framework for simulation-based collaboration, and illustrates its potential through concrete human-agent collaboration scenarios.
40. 【2603.11667】A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy
链接:https://arxiv.org/abs/2603.11667
作者:María Isabel Rivas Ginel,Janiça Hackenbuchner,Alina Secară,Ralph Krüger,Caroline Rossi
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:today increasingly automated, increasingly automated language, negotiated in today, today increasingly, increasingly automated
备注: Under review
点击查看摘要
Abstract:This paper examines how value is constructed and negotiated in today's increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman's framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.
41. 【2603.11665】Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
链接:https://arxiv.org/abs/2603.11665
作者:Junjie Wu,Xuan Kan,Zihao He,Shunwen Tan,Bo Pan,Kaitai Zhang
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.
42. 【2603.11650】QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
链接:https://arxiv.org/abs/2603.11650
作者:Jihao Zhao,Daixuan Li,Pengfei Li,Shuaishuai Zu,Biao Qin,Hongyan Liu
类目:Computation and Language (cs.CL)
关键词:effectiveness upper bound, retrieval-augmented generation, effectiveness upper, upper bound, bound of retrieval-augmented
备注:
点击查看摘要
Abstract:The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.
43. 【2603.11611】Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE
链接:https://arxiv.org/abs/2603.11611
作者:Mohammad Aflah Khan,Krishna P. Gummadi,Manish Gupta,Abhilasha Ravichander
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Rotary Positional Embedding, relative positional information, Positional Embedding, encoding relative positional, positional information
备注:
点击查看摘要
Abstract:Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently across model size, sequence lengths and datasets of varying quality and architectures, with higher-quality data resulting in lower overall loss and similar benchmark performance; and (3) some models trained with NoPE (No Positional Encoding) showcase unstable learning trajectories, which can be alleviated through minimal RoPE application or QK-Norm which converges to a higher loss. Together, these results offer practical guidance for model designers aiming to balance efficiency and training stability, while emphasizing the previously overlooked importance of partial RoPE.
44. 【2603.11597】Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese
链接:https://arxiv.org/abs/2603.11597
作者:Masataka Kawai,Singo Sakashita,Shumpei Ishikawa,Shogo Watanabe,Anna Matsuoka,Mikio Sakurai,Yasuto Fujimoto,Yoshiyuki Takahara,Atsushi Ohara,Hirohiko Miyake,Genichiro Ishii
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Japanese remains unexplored, Japanese pathology reports, Japanese pathology report, supporting pathology report, pathology report writing
备注: 9 pages (including bibliography), 2 figures, 6 tables
点击查看摘要
Abstract:The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.
45. 【2603.11583】UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization
链接:https://arxiv.org/abs/2603.11583
作者:Ofir Marom
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Model, task depends heavily, Large Language, depends heavily, natural language
备注:
点击查看摘要
Abstract:The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.
46. 【2603.11578】Streaming Translation and Transcription Through Speech-to-Text Causal Alignment
链接:https://arxiv.org/abs/2603.11578
作者:Roman Koshkin,Jeon Haesung,Lianbo Liu,Hao Shi,Mengjie Zhao,Yusuke Fujita,Yui Sudo
类目:Computation and Language (cs.CL)
关键词:offline machine translation, Simultaneous machine translation, machine translation models, machine translation, translation models coupled
备注: 16 pages, 6 figures
点击查看摘要
Abstract:Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.
47. 【2603.11564】Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
链接:https://arxiv.org/abs/2603.11564
作者:Zhenxu Tian,Yi Su,Juntao Li,Min Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, efficient Large Language, Language Models, Large Language, efficient Large
备注:
点击查看摘要
Abstract:The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).
48. 【2603.11545】One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries
链接:https://arxiv.org/abs/2603.11545
作者:Mayank Saini Arit Kumar Bishwas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:coordinates specialized tools, document modalities, autonomous multimodal query, multimodal query processing, present an agentic
备注: 19 pages, 3 figures
点击查看摘要
Abstract:We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.
49. 【2603.11535】Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing
链接:https://arxiv.org/abs/2603.11535
作者:Hanchi Sun,Yixin Liu,Yonghui Wu,Lichao Sun
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:limiting dynamic computation, dynamic computation allocation, requiring auxiliary losses, fixed number, Token-choice
备注:
点击查看摘要
Abstract:Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.
50. 【2603.11513】Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
链接:https://arxiv.org/abs/2603.11513
作者:Sanchit Pandey(BITS Pilani, Hyderabad, India)
类目:Computation and Language (cs.CL)
关键词:improve factual accuracy, utilize retrieved information, effectively utilize retrieved, Retrieval, augmented generation RAG
备注: 10 pages, 5 figures, planning to submit to arr march 2026. Code and evaluation data: [this https URL](https://anonymous.4open.science/r/rag-utilization-study-C67F) . Earlier draft preprint available on Zenodo: [this https URL](https://zenodo.org/records/18870116) (note: this arXiv submission is an updated draft)
点击查看摘要
Abstract:Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.
51. 【2603.11510】ny Aya: Bridging Scale and Multilingual Depth
链接:https://arxiv.org/abs/2603.11510
作者:Alejandro R. Salamanca,Diana Abagyan,Daniel D'souza,Ammar Khairi,David Mora,Saurabh Dash,Viraat Aryabumi,Sara Rajaee,Mehrnaz Mofakhami,Ananya Sahu,Thomas Euyang,Brittawnya Prince,Madeline Smith,Hangyu Lin,Acyr Locatelli,Sara Hooker,Tom Kocmi,Aidan Gomez,Ivan Zhang,Phil Blunsom,Nick Frosst,Joelle Pineau,Beyza Ermis,Ahmet Üstün,Julia Kreutzer,Marzieh Fadaee
类目:Computation and Language (cs.CL)
关键词:Tiny Aya redefines, Tiny Aya, small multilingual language, Aya redefines, strong multilingual understanding
备注:
点击查看摘要
Abstract:Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.
52. 【2603.11504】LongFlow: Efficient KV Cache Compression for Reasoning M
链接:https://arxiv.org/abs/2603.11504
作者:Yi Su,Zhenxu Tian,Dan Qiao,Yuechi Zhou,Juntao Li,Min Zhang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:complex tasks including, tasks including mathematical, shown strong performance, Recent reasoning models, including mathematical reasoning
备注:
点击查看摘要
Abstract:Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.
53. 【2603.11495】ry, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs
链接:https://arxiv.org/abs/2603.11495
作者:Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du,Dacheng Tao
类目:Computation and Language (cs.CL)
关键词:Large Language Models, empowers Large Language, Tool-calling empowers Large, Language Models, Large Language
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.
54. 【2603.11482】AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style
链接:https://arxiv.org/abs/2603.11482
作者:Joonyong Park,Jerry Li
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:objective metric exists, costly subjective judgments, standardized objective metric, voices currently relies, relies on costly
备注:
点击查看摘要
Abstract:Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.
55. 【2603.11446】LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction
链接:https://arxiv.org/abs/2603.11446
作者:Yuzhi Liang,Lixiang Ma,Xinrong Zhu
类目:Computation and Language (cs.CL)
关键词:Pre-trained Language Models, based on Pre-trained, Pre-trained Language, Large Language Model, Mainstream methods
备注:
点击查看摘要
Abstract:Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.
56. 【2603.11415】BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion
链接:https://arxiv.org/abs/2603.11415
作者:Varun Iyer,Cornelia Caragea
类目:Computation and Language (cs.CL)
关键词:Abstractive summarization requires, Abstractive summarization, summarization requires models, source document, Bigram Lookahead Promotion
备注: LREC 2026
点击查看摘要
Abstract:Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine-tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training-free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine-tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama-3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Gemma-2-9b-it on CNN/DM, CCSum, Multi-News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at this https URL
57. 【2603.11414】MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models
链接:https://arxiv.org/abs/2603.11414
作者:Michiko Yoshitake,Yuta Suzuki,Ryo Igarashi,Yoshitaka Ushiku,Keisuke Nagato
类目:Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)
关键词:solve university-level materials, require accurate interpretation, multimodal large language, large language models, benchmark dataset designed
备注: 27 pages, 4 tables, 6 figures
点击查看摘要
Abstract:We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.
58. 【2603.11412】Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects
链接:https://arxiv.org/abs/2603.11412
作者:Amani Maina-Kilaas,Roger Levy
类目:Computation and Language (cs.CL)
关键词:linguistic representations affect, representations affect processing, affect processing difficulty, linguistic representations, surprisal theory
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated -- suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects -- where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.
59. 【2603.11409】Speak or Stay Silent: Context-Aware Turn-Taking in Multi-Party Dialogue
链接:https://arxiv.org/abs/2603.11409
作者:Kratika Bhagtani,Mrinal Anand,Yu Chen Xu,Amit Kumar Singh Yadav
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Existing voice, Existing, treat every detected, Abstract, detected pause
备注: Submitted for review to Interspeech 2026
点击查看摘要
Abstract:Existing voice AI assistants treat every detected pause as an invitation to speak. This works in dyadic dialogue, but in multi-party settings, where an AI assistant participates alongside multiple speakers, pauses are abundant and ambiguous. An assistant that speaks on every pause becomes disruptive rather than useful. In this work, we formulate context-aware turn-taking: at every detected pause, given the full conversation context, our method decides whether the assistant should speak or stay silent. We introduce a benchmark of over 120K labeled conversations spanning three multi-party corpora. Evaluating eight recent large language models, we find that they consistently fail at context-aware turn-taking under zero-shot prompting. We then propose a supervised fine-tuning approach with reasoning traces, improving balanced accuracy by up to 23 percentage points. Our findings suggest that context-aware turn-taking is not an emergent capability; it must be explicitly trained.
60. 【2603.11394】Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning
链接:https://arxiv.org/abs/2603.11394
作者:Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Gao,Juming Xiong,Zhijun Yin,Bradley A. Malin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Patients and clinicians, healthcare inquiries, large language models, clinicians are increasingly, increasingly using chatbots
备注:
点击查看摘要
Abstract:Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.
61. 【2603.11342】Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation
链接:https://arxiv.org/abs/2603.11342
作者:Aria Nourbakhsh,Salima Lamsiyah,Adelaide Danilov,Christoph Schommer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:neural network models, area of research, output of neural, neural network, active area
备注: 37 pages, 11 figures
点击查看摘要
Abstract:The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.
62. 【2603.11327】Meta-Reinforcement Learning with Self-Reflection for Agentic Search
链接:https://arxiv.org/abs/2603.11327
作者:Teng Xiao,Yige Yuan,Hamish Ivison,Huaisheng Zhu,Faeze Brahman,Nathan Lambert,Pradeep Dasigi,Noah A. Smith,Hannaneh Hajishirzi
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:meta reinforcement learning, in-context meta reinforcement, reinforcement learning, formulation for agentic, meta reinforcement
备注: 23 pages, Preprint
点击查看摘要
Abstract:This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at this https URL.
63. 【2603.11321】Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
链接:https://arxiv.org/abs/2603.11321
作者:Yuning Wu,Ke Wang,Devin Chen,Kai Wei
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Verifiable Rewards, pure Reinforcement Learning, Reinforcement Learning, Group Relative Policy, post-training reasoning models
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.
64. 【2603.11295】mporal Text Classification with Large Language Models
链接:https://arxiv.org/abs/2603.11295
作者:Nishat Raihan,Marcos Zampieri
类目:Computation and Language (cs.CL)
关键词:Temporal Text Classification, Large Language Models, Large Language, Languages change, Abstract
备注:
点击查看摘要
Abstract:Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.
65. 【2603.11281】hReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
链接:https://arxiv.org/abs/2603.11281
作者:Monica Munnangi,Saiph Savage
类目:Computation and Language (cs.CL)
关键词:question-answering benchmarks predominantly, real patient consultations, Medical question-answering benchmarks, clarification-seeking nature, benchmarks predominantly evaluate
备注:
点击查看摘要
Abstract:Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.
66. 【2603.11254】Artificial Intelligence for Sentiment Analysis of Persian Poetry
链接:https://arxiv.org/abs/2603.11254
作者:Arash Zargar,Abolfazl Moshiri,Mitra Shafaei,Shabnam Rahimi-Golkhandan,Mohamad Tavakoli-Targhi,Farzad Khalvati
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Artificial Intelligence, creating textual data, Recent advancements, large language models, language models
备注:
点击查看摘要
Abstract:Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.
67. 【2603.11253】LLMs Can Infer Political Alignment from Online Conversations
链接:https://arxiv.org/abs/2603.11253
作者:Byunghwee Lee,Sangyeon Kim,Filippo Menczer,Yong-Yeol Ahn,Haewoon Kwak,Jisun An
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:seemingly innocuous preferences, reveal private traits, seemingly innocuous, specific slang, private traits
备注: 55 pages; 4 figures in the main text and 18 supplementary figures, 11 supplementary tables
点击查看摘要
Abstract:Due to the correlational structure in our traits such as identities, cultures, and political attitudes, seemingly innocuous preferences such as following a band or using a specific slang, can reveal private traits. This possibility, especially when combined with massive, public social data and advanced computational methods, poses a fundamental privacy risk. Given our increasing data exposure online and the rapid advancement of AI are increasing the misuse potential of such risk, it is therefore critical to understand capacity of large language models (LLMs) to exploit it. Here, using online discussions on this http URL and Reddit, we show that LLMs can reliably infer hidden political alignment, significantly outperforming traditional machine learning models. Prediction accuracy further improves as we aggregate multiple text-level inferences into a user-level prediction, and as we use more politics-adjacent domains. We demonstrate that LLMs leverage the words that can be highly predictive of political alignment while not being explicitly political. Our findings underscore the capacity and risks of LLMs for exploiting socio-cultural correlates.
68. 【2603.11228】Markovian Generation Chains in Large Language Models
链接:https://arxiv.org/abs/2603.11228
作者:Mingmeng Geng,Amr Mohamed,Guokan Shang,Michalis Vazirgiannis,Thierry Poibeau
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, language models, raises an important, important question, large language
备注:
点击查看摘要
Abstract:The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.
69. 【2603.11223】MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries
链接:https://arxiv.org/abs/2603.11223
作者:Riccardo Campi,Nicolò Oreste Pinciroli Vago,Mathyas Giudici,Marco Brambilla,Piero Fraternali
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:lose important contextual, important contextual nuance, requires composing answers, Retrieval-Augmented Generation, Knowledge Graphs
备注: Our code is available at [this https URL](https://github.com/DataSciencePolimi/MDER-DR_RAG)
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at this https URL.
70. 【2603.11220】Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models
链接:https://arxiv.org/abs/2603.11220
作者:Qingtao Pan,Zhihao Dou,Shuo Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Multimodal Models, Large Multimodal, Multimodal Models, adapt varying computational, varying computational budgets
备注:
点击查看摘要
Abstract:Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
71. 【2603.11193】DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning
链接:https://arxiv.org/abs/2603.11193
作者:Hanxu Hu,Yuxuan Wang,Maggie Huan,Jannis Vamvas,Yinya Huang,Zhijiang Guo,Rico Sennrich
类目:Computation and Language (cs.CL)
关键词:Verifiable Rewards, Reinforcement learning, learning with Verifiable, large language models, eliciting reasoning capabilities
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.
72. 【2603.11168】Huntington Disease Automatic Speech Recognition with Biomarker Supervision
链接:https://arxiv.org/abs/2603.11168
作者:Charles L. Wang,Cady Chen,Ziwei Gong,Julia Hirschberg
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Automatic speech recognition, articulatory distortion challenge, distortion challenge current, Huntington disease, speech remains underexplored
备注:
点击查看摘要
Abstract:Automatic speech recognition (ASR) for pathological speech remains underexplored, especially for Huntington's disease (HD), where irregular timing, unstable phonation, and articulatory distortion challenge current models. We present a systematic HD-ASR study using a high-fidelity clinical speech corpus not previously used for end-to-end ASR training. We compare multiple ASR families under a unified evaluation, analyzing WER as well as substitution, deletion, and insertion patterns. HD speech induces architecture-specific error regimes, with Parakeet-TDT outperforming encoder-decoder and CTC baselines. HD-specific adaptation reduces WER from 6.99% to 4.95% and we also propose a method for using biomarker-based auxiliary supervision and analyze how error behavior is reshaped in severity-dependent ways rather than uniformly improving WER. We open-source all code and models.
73. 【2603.11137】Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
链接:https://arxiv.org/abs/2603.11137
作者:Jongwoo Ko,Sara Abdali,Young Jin Kim,Tianyi Chen,Pashmina Cameron
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:On-policy distillation, Relaxed On-Policy Distillation, transferring reasoning capabilities, capacity-constrained models, negative transfer
备注: Code will be available soon
点击查看摘要
Abstract:On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.
74. 【2603.11126】Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion
链接:https://arxiv.org/abs/2603.11126
作者:Yuanhong Wu,Djallel Bouneffouf,D. Frank Hsu
类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)
关键词:Aligning large language, large language models, Aligning large, language models, safe deployment
备注: 5 pages, 3 figures, accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
点击查看摘要
Abstract:Aligning large language models (LLMs) with human values is a central challenge for ensuring trustworthy and safe deployment. While existing methods such as Reinforcement Learning from Human Feedback (RLHF) and its variants have improved alignment, they often rely on a single evaluator or narrowly defined reward signals, limiting their ability to capture ethical pluralism. In this work, we propose the Value Alignment System using Combinatorial Fusion Analysis (VAS-CFA), a framework that operationalizes multi-agent fusion alignment. It instantiates multiple moral agents, each fine-tuned to represent a distinct normative perspective, and fuses their outputs using CFA with both rank- and score-based aggregation. This design leverages cognitive diversity, between agents, to mitigate conflicts and redundancies across multiple agents, producing responses that better reflect human values. Empirical evaluation demonstrates that VAS-CFA outperforms both single agent baselines and prior aggregation approaches on standard metrics, showing that multi-agent fusion provides a robust and effective mechanism for advancing value alignment in LLMs.
75. 【2603.11123】Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition
链接:https://arxiv.org/abs/2603.11123
作者:Yinfeng Xia,Jian Tang,Junfeng Hou,Gaopeng Xu,Haitao Yao
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Automatic Speech Recognition, Automatic Speech
备注: Submitted to Interspeech 2026
点击查看摘要
Abstract:Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. We propose a joint training paradigm that enables the system to seamlessly transition between two recognition modes without any architectural modifications. Furthermore, we introduce a context-aware training paradigm and a co-designed fallback decoding strategy, which can enhance streaming recognition accuracy without introducing additional latency. The experimental results demonstrate that Uni-ASR not only achieves competitive performance within non-streaming mode, but also demonstrates strong effectiveness in streaming scenarios under diverse latency constraints.
76. 【2603.11078】CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
链接:https://arxiv.org/abs/2603.11078
作者:Kristen Pereira,Neelabh Sinha,Rajat Ghosh,Debojyoti Dutta
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:code review agents, Recent advances, code review, enabled code review, review agents
备注:
点击查看摘要
Abstract:Recent advances in frontier large language models have enabled code review agents that operate in open-ended, reasoning-intensive settings. However, the lack of standardized benchmarks and granular evaluation protocols makes it difficult to assess behavior of code review agents beyond coarse success metrics, particularly for tasks where false positives are costly. To address this gap, we introduce CR-Bench, a benchmarking dataset, and CR-Evaluator, a fine-grained evaluation pipeline for code review agents. Using these tools, we conduct a preliminary study evaluating both a single-shot agent and a Reflexion-based agent across two frontier models. We find that code review agents can exhibit a low signal-to-noise ratio when designed to identify all hidden issues, obscuring true progress and developer productivity when measured solely by resolution rates. Our analysis identifies the hidden trade-off between issue resolution and spurious findings, revealing a frontier that constrains effective agent design. Together, CR-Bench and CR-Evaluator provide a timely foundation for studying and developing code review agents as LLM-based systems transition from controlled benchmarks to real-world software engineering workflows.
77. 【2603.11067】Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation
链接:https://arxiv.org/abs/2603.11067
作者:Jingtao Wang,Yucong Wang,Jun Ding,Rui Cai,Xun Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:achieve remarkable performance, require costly training, Large language models, achieve remarkable, remarkable performance
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
78. 【2603.11053】Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple
链接:https://arxiv.org/abs/2603.11053
作者:Amirhossein Bozorgkhoo,Igor Molybog
类目:Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
关键词:multiple language models, Speculative decoding, accelerate infer, multiple language, language models
备注:
点击查看摘要
Abstract:Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.
79. 【2603.11051】OpenSanctions Pairs: Large-Scale Entity Matching with LLMs
链接:https://arxiv.org/abs/2603.11051
作者:Chandler Smith,Magnus Sesodia,Friedrich Lindenberg,Christian Schroeder de Witt
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:real-world international sanctions, international sanctions aggregation, release OpenSanctions Pairs, large-scale entity matching, analyst deduplication
备注:
点击查看摘要
Abstract:We release OpenSanctions Pairs, a large-scale entity matching benchmark derived from real-world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross-script names, noisy and missing attributes, and set-valued fields typical of compliance workflows. We benchmark a production rule-based matcher (nomenklatura RegressionV1 algorithm) against open- and closed-source LLMs in zero- and few-shot settings. Off-the-shelf LLMs substantially outperform the production rule-based baseline (91.33\% F1), reaching up to 98.95\% F1 (GPT-4o) and 98.23\% F1 with a locally deployable open model (DeepSeek-R1-Distill-Qwen-14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in-context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule-based system over-matches (high false positives), whereas LLMs primarily fail on cross-script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty-aware review. Code available at this https URL
80. 【2603.11408】Beyond Polarity: Multi-Dimensional LLM Sentiment Signals for WTI Crude Oil Futures Return Prediction
链接:https://arxiv.org/abs/2603.11408
作者:Dehao Dai,Ding Ma,Dou Liu,Kerui Geng,Yiqing Wang
类目:atistical Finance (q-fin.ST); Computation and Language (cs.CL)
关键词:prices remains challenging, oil prices remains, crude oil prices, WTI crude oil, traditional polarity-based sentiment
备注: 28 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Forecasting crude oil prices remains challenging because market-relevant information is embedded in large volumes of unstructured news and is not fully captured by traditional polarity-based sentiment measures. This paper examines whether multi-dimensional sentiment signals extracted by large language models improve the prediction of weekly WTI crude oil futures returns. Using energy-sector news articles from 2020 to 2025, we construct five sentiment dimensions covering relevance, polarity, intensity, uncertainty, and forwardness based on GPT-4o, Llama 3.2-3b, and two benchmark models, FinBERT and AlphaVantage. We aggregate article-level signals to the weekly level and evaluate their predictive performance in a classification framework. The best results are achieved by combining GPT-4o and FinBERT, suggesting that LLM-based and conventional financial sentiment models provide complementary predictive information. SHAP analysis further shows that intensity- and uncertainty-related features are among the most important predictors, indicating that the predictive value of news sentiment extends beyond simple polarity. Overall, the results suggest that multi-dimensional LLM-based sentiment measures can improve commodity return forecasting and support energy-market risk monitoring.
信息检索
1. 【2603.11796】Enhancing Music Recommendation with User Mood Input
链接:https://arxiv.org/abs/2603.11796
作者:Terence Zeng
类目:Information Retrieval (cs.IR)
关键词:music streaming platforms, modern music streaming, streaming platforms, essential in modern, vast amount
备注: 28 pages, 9 figures, 2 tables
点击查看摘要
Abstract:Recommendation systems have become essential in modern music streaming platforms, due to the vast amount of content available. A common approach in recommendation systems is collaborative filtering, which suggests content to users based on the preferences of others with similar patterns. However, this method performs poorly in domains where interactions are sparse, such as music. Content-based filtering is an alternative approach that examines the qualities of the items themselves. Prior work has explored a range of content-filtering techniques for music, including genre classification, instrument detection, and lyrics analysis. In the literature review component of this work, we examine these methods in detail. Music emotion recognition is a type of content-based filtering that is less explored but has significant potential. Since a user's emotional state influences their musical choices, incorporating user mood into recommendation systems is an alternative way to personalize the listening experience. In this study, we explore a mood-assisted recommendation system that suggests songs based on the desired mood using the energy-valence spectrum. Single-blind experiments are conducted, in which participants are presented with two recommendations (one generated from a mood-assisted recommendation system and one from a baseline system) and are asked to rate them. Results show that integrating user mood leads to a statistically significant improvement in recommendation quality, highlighting the potential of such approaches.
2. 【2603.11759】Modeling Trial-and-Error Navigation With a Sequential Decision Model of Information Scent
链接:https://arxiv.org/abs/2603.11759
作者:Xiaofu Jin,Yunpeng Bai,Antti Oulasvirta
类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:nested in hierarchies, struggle to locate, ambiguous or deeply, deeply nested, Users
备注:
点击查看摘要
Abstract:Users often struggle to locate an item within an information architecture, particularly when links are ambiguous or deeply nested in hierarchies. Information scent has been used to explain why users select incorrect links, but this concept assumes that users see all available links before deciding. In practice, users frequently select a link too quickly, overlook relevant cues, and then rely on backtracking when errors occur. We extend the concept of information scent by framing navigation as a sequential decision-making problem under memory constraints. Specifically, we assume that users do not scan entire pages but instead inspect strategically, looking "just enough" to find the target given their time budget. To choose which item to inspect next, they consider both local (this page) and global (site) scent; however, both are constrained by memory. Trying to avoid wasting time, they occasionally choose the wrong links without inspecting everything on a page. Comparisons with empirical data show that our model replicates key navigation behaviors: premature selections, wrong turns, and recovery from backtracking. We conclude that trial-and-error behavior is well explained by information scent when accounting for the sequential and bounded characteristics of the navigation problem.
3. 【2603.11610】Federated Learning and Unlearning for Recommendation with Personalized Data Sharing
链接:https://arxiv.org/abs/2603.11610
作者:Liang Qu,Jianxin Li,Wei Yuan,Shangfei Zheng,Lu Chen,Chengfei Liu,Hongzhi Yin
类目:Information Retrieval (cs.IR)
关键词:Federated recommender systems, coordinating model training, recommender systems, protecting user privacy, data
备注: 14 pages
点击查看摘要
Abstract:Federated recommender systems (FedRS) have emerged as a paradigm for protecting user privacy by keeping interaction data on local devices while coordinating model training through a central server. However, most existing federated recommender systems adopt a one-size-fits-all assumption on user privacy, where all users are required to keep their data strictly local. This setting overlooks users who are willing to share their data with the server in exchange for better recommendation performance. Although several recent studies have explored personalized user data sharing in FedRS, they assume static user privacy preferences and cannot handle user requests to remove previously shared data and its corresponding influence on the trained model. To address this limitation, we propose FedShare, a federated learn-unlearn framework for recommender systems with personalized user data sharing. FedShare not only allows users to control how much interaction data is shared with the server, but also supports data unsharing requests by removing the influence of the unshared data from the trained model. Specifically, FedShare leverages shared data to construct a server-side high-order user-item graph and uses contrastive learning to jointly align local and global representations. In the unlearning phase, we design a contrastive unlearning mechanism that selectively removes representations induced by the unshared data using a small number of historical embedding snapshots, avoiding the need to store large amounts of historical gradient information as required by existing federated recommendation unlearning methods. Extensive experiments on three public datasets demonstrate that FedShare achieves strong recommendation performance in both the learning and unlearning phases, while significantly reducing storage overhead in the unlearning phase compared with state-of-the-art baselines.
4. 【2603.11486】Quantized Inference for OneRec-V2
链接:https://arxiv.org/abs/2603.11486
作者:Yi Su,Xinchen Luo,Hongtao Cheng,Ziteng Shu,Yunfeng Zhao,Fangyu Zhang,Jiaqiang Liu,Xiao Liang,Yiwu Liu,Ruiming Tang
类目:Information Retrieval (cs.IR)
关键词:preserving model quality, substantial system-level benefits, demonstrated substantial system-level, Quantized inference, demonstrated substantial
备注:
点击查看摘要
Abstract:Quantized inference has demonstrated substantial system-level benefits in large language models while preserving model quality. In contrast, reliably applying low-precision quantization to recommender systems remains challenging in industrial settings. This difficulty arises from differences in training paradigms, architectural patterns, and computational characteristics, which lead to distinct numerical behaviors in weights and activations. Traditional recommender models often exhibit high-magnitude and high-variance weights and activations, making them more sensitive to quantization-induced perturbations. In addition, recommendation workloads frequently suffer from limited hardware utilization, limiting the practical gains of low-precision computation. In this work, we revisit low-precision inference in the context of generative recommendation. Through empirical distribution analysis, we show that the weight and activation statistics of OneRec-V2 are significantly more controlled and closer to those of large language models than traditional recommendation models. Moreover, OneRec-V2 exhibits a more compute-intensive inference pattern with substantially higher hardware utilization, enabling more end-to-end throughput gains with low-precision computation. Leveraging this property, we develop a FP8 post training quantization framework and integrate it into an optimized inference infrastructure. The proposed joint optimization achieves a 49\% reduction in end-to-end inference latency and a 92\% increase in throughput. Extensive online A/B testing further confirms that FP8 inference introduces no degradation in core metrics. These results suggest that as recommender systems evolve toward the paradigms of large language models, algorithm-level and system-level optimization techniques established in the LLM domain can be effectively adapted to large-scale recommendation workloads.
5. 【2603.11407】Reproducible Synthetic Clinical Letters for Seizure Frequency Information Extraction
链接:https://arxiv.org/abs/2603.11407
作者:Yujian Gan,Stephen H. Barlow,Ben Holgate,Joe Davies,James T. Teo,Joel S. Winston,Mark P. Richardson
类目:Information Retrieval (cs.IR)
关键词:variable free-text clinic, annotate and share, recorded in variable, variable free-text, hard to annotate
备注:
点击查看摘要
Abstract:Seizure-frequency information is important for epilepsy research and clinical care, but it is usually recorded in variable free-text clinic letters that are hard to annotate and share. We developed a reproducible, privacy-preserving framework for extracting seizure frequency using fully synthetic yet task-faithful epilepsy letters. We defined a structured label scheme covering common descriptions of seizure burden, including explicit rates, ranges, clusters, seizure-free intervals, unknown frequency, and explicit no-seizure statements. A teacher language model generated NHS-style synthetic letters paired with normalized labels, rationales, and evidence spans. We fine-tuned several open-weight language models (4B-14B parameters) on these synthetic letters to extract seizure frequency from full documents, comparing direct numeric prediction with structured label prediction and testing evidence-grounded outputs. On a clinician-checked held-out set of real clinic letters, models trained only on synthetic data generalized well, and structured labels consistently outperformed direct numeric regression. With 15,000 synthetic training letters, models achieved micro-F1 scores up to 0.788 for fine-grained categories and 0.847 for pragmatic categories; a medically oriented 4B model achieved 0.787 and 0.858, respectively. Evidence-grounded outputs also supported rapid clinical verification and error analysis. These results show that synthetic, structured, evidence-grounded supervision can enable robust seizure-frequency extraction without sharing sensitive patient text and may generalize to other temporally complex clinical information extraction tasks.
6. 【2603.11223】MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries
链接:https://arxiv.org/abs/2603.11223
作者:Riccardo Campi,Nicolò Oreste Pinciroli Vago,Mathyas Giudici,Marco Brambilla,Piero Fraternali
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:lose important contextual, important contextual nuance, requires composing answers, Retrieval-Augmented Generation, Knowledge Graphs
备注: Our code is available at [this https URL](https://github.com/DataSciencePolimi/MDER-DR_RAG)
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at this https URL.
7. 【2603.11051】OpenSanctions Pairs: Large-Scale Entity Matching with LLMs
链接:https://arxiv.org/abs/2603.11051
作者:Chandler Smith,Magnus Sesodia,Friedrich Lindenberg,Christian Schroeder de Witt
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:real-world international sanctions, international sanctions aggregation, release OpenSanctions Pairs, large-scale entity matching, analyst deduplication
备注:
点击查看摘要
Abstract:We release OpenSanctions Pairs, a large-scale entity matching benchmark derived from real-world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross-script names, noisy and missing attributes, and set-valued fields typical of compliance workflows. We benchmark a production rule-based matcher (nomenklatura RegressionV1 algorithm) against open- and closed-source LLMs in zero- and few-shot settings. Off-the-shelf LLMs substantially outperform the production rule-based baseline (91.33\% F1), reaching up to 98.95\% F1 (GPT-4o) and 98.23\% F1 with a locally deployable open model (DeepSeek-R1-Distill-Qwen-14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in-context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule-based system over-matches (high false positives), whereas LLMs primarily fail on cross-script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty-aware review. Code available at this https URL
计算机视觉
1. 【2603.12267】EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
链接:https://arxiv.org/abs/2603.12267
作者:Tianwei Xiong,Jun Hao Liew,Zilong Huang,Zhijie Lin,Jiashi Feng,Xihui Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative models rely, discrete token sequences, video generative models, textbf, generative models
备注: Accepted by CVPR 2026. Project page: [this https URL](https://silentview.github.io/EVATok/)
点击查看摘要
Abstract:Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
2. 【2603.12266】MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
链接:https://arxiv.org/abs/2603.12266
作者:Haozhan Shen,Shilin Yan,Hongwei Xue,Shuaiqi Lu,Xiaojun Tang,Guannan Zhang,Tiancheng Zhao,Jianwei Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注: Project Page: [this https URL](https://accio-lab.github.io/MM-CondChain)
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
3. 【2603.12265】OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
链接:https://arxiv.org/abs/2603.12265
作者:Yibin Yan,Jilan Xu,Shangzhe Di,Haoning Wu,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real-time streaming environments, Modern visual agents, Modern visual, physically structured, structured to operate
备注: Technical Report. Project Page: [this https URL](https://go2heart.github.io/omnistream/)
点击查看摘要
Abstract:Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
4. 【2603.12264】GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
链接:https://arxiv.org/abs/2603.12264
作者:Mingxin Liu,Ziqian Fan,Zhaokai Wang,Leyao Gu,Zirun Zhu,Yiguo He,Yuchen Yang,Changyao Tian,Xiangyu Zhao,Ning Liao,Shaofeng Zhang,Qibing Ren,Zhihang Zhong,Xuanhe Zhou,Junchi Yan,Xue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:target joint understanding, offering limited assessment, shallow commonsense reasoning, models target joint, joint understanding
备注: 49 pages, 23 figures, 10 tables; Project Page: [this https URL](https://grade-bench.github.io/) , Code: [this https URL](https://github.com/VisionXLab/GRADE) , Dataset: [this https URL](https://huggingface.co/datasets/VisionXLab/GRADE)
点击查看摘要
Abstract:Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.
5. 【2603.12262】Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
链接:https://arxiv.org/abs/2603.12262
作者:Yiran Guan,Liang Yin,Dingkang Liang,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Video Large Language, Large Language, Online Video Large, Language Models
备注:
点击查看摘要
Abstract:Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at this https URL.
6. 【2603.12261】he Latent Color Subspace: Emergent Order in High-Dimensional Chaos
链接:https://arxiv.org/abs/2603.12261
作者:Mateusz Pach,Jessica Bader,Quentin Bouniot,Serge Belongie,Zeynep Akata
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:images remains difficult, generated images remains, achieving fine-grained control, generation models, advanced rapidly
备注: Preprint
点击查看摘要
Abstract:Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at this https URL.
7. 【2603.12257】DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning
链接:https://arxiv.org/abs/2603.12257
作者:Yujie Wei,Xinyu Liu,Shiwei Zhang,Hangjie Yuan,Jinbo Xing,Zhekai Chen,Xiang Wang,Haonan Qiu,Rui Zhao,Yutong Feng,Ruihang Chu,Yingya Zhang,Yike Guo,Xihui Liu,Hongming Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-granularity motion remains, revolutionized video synthesis, significant challenge, remains a significant, identity
备注: Project Page: [this https URL](https://dreamvideo-omni.github.io)
点击查看摘要
Abstract:While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.
8. 【2603.12255】Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
链接:https://arxiv.org/abs/2603.12255
作者:Fangfu Liu,Diankun Wu,Jiawei Chi,Yimo Cai,Yi-Hsin Hung,Xumin Yu,Hao Li,Han Hu,Yongming Rao,Yueqi Duan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:understand real-world spaces, Humans perceive, visual observations, perceive and understand, understand real-world
备注: Project Page: [this https URL](https://liuff19.github.io/Spatial-TTT)
点击查看摘要
Abstract:Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: this https URL.
9. 【2603.12254】Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
链接:https://arxiv.org/abs/2603.12254
作者:Baifeng Shi,Stephanie Fu,Long Lian,Hanrong Ye,David Eigen,Aaron Reite,Boyi Li,Jan Kautz,Song Han,David M. Chan,Pavlo Molchanov,Trevor Darrell,Hongxu Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-modal large language, large language models, Multi-modal large, significant spatiotemporal redundancy, advanced general-purpose video
备注: CVPR 2026. Project page: [this https URL](https://autogaze.github.io/)
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: this https URL.
10. 【2603.12252】EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
链接:https://arxiv.org/abs/2603.12252
作者:Xuanlang Dai,Yujie Zhou,Long Xing,Jiazi Bu,Xilin Wei,Yuhong Liu,Beichen Zhang,Kai Chen,Yuhang Zang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
备注: 23 pages, 18 figures
点击查看摘要
Abstract:Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
11. 【2603.12250】DVD: Deterministic Video Depth Estimation with Generative Priors
链接:https://arxiv.org/abs/2603.12250
作者:Hongfei Zhang,Harold Haodong Chen,Chenfei Liao,Jing He,Zixin Zhang,Haodong Li,Yihao Liang,Kanghao Chen,Bin Ren,Xu Zheng,Shuai Yang,Kun Zhou,Yinchuan Li,Nicu Sebe,Ying-Cong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:resolve semantic ambiguities, demand massive labeled, massive labeled datasets, Existing video depth, generative models suffer
备注: Project: [this https URL](https://dvd-project.github.io/)
点击查看摘要
Abstract:Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
12. 【2603.12249】SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
链接:https://arxiv.org/abs/2603.12249
作者:Ziyu Chen,Yilun Zhao,Chengye Wang,Rilyn Han,Manasi Patwardhan,Arman Cohan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Constructing scientific multimodal, trade-off among scale, Constructing scientific, involves an inherent, inherent trade-off
备注:
点击查看摘要
Abstract:Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
13. 【2603.12247】rust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
链接:https://arxiv.org/abs/2603.12247
作者:Xiangyu Zhao,Peiyuan Zhang,Junming Lin,Tianhao Liang,Yuchen Duan,Shengyuan Ding,Changyao Tian,Yuhang Zang,Junchi Yan,Xue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reinforcement learning, enhancing image editing, Image Reward Modeling, Faithful Image Reward, Faithful Image
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at this https URL.
14. 【2603.12245】One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
链接:https://arxiv.org/abs/2603.12245
作者:Moayed Haji-Ali,Willi Menapace,Ivan Skorokhodov,Dogyun Park,Anil Kag,Michael Vasilkovsky,Sergey Tulyakov,Vicente Ordonez,Aliaksandr Siarohin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieve high generative, limiting principled latency-quality, principled latency-quality trade-offs, wasting resource allocation, high generative quality
备注: Project page: [this https URL](https://snap-research.github.io/elit/)
点击查看摘要
Abstract:Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: this https URL
15. 【2603.12240】BiGain: Unified Token Compression for Joint Generation and Classification
链接:https://arxiv.org/abs/2603.12240
作者:Jiacheng Liu,Shengkun Tang,Jiacheng Cui,Dongkuan Xu,Zhiqiang Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:typically optimize synthesis, optimize synthesis quality, ignore discriminative capacity, diffusion models, typically optimize
备注: CVPR 2026. Code: [this https URL](https://github.com/Greenoso/BiGain)
点击查看摘要
Abstract:Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.
16. 【2603.12238】SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
链接:https://arxiv.org/abs/2603.12238
作者:Jun Luo,Jiaxiang Tang,Ruijie Lu,Gang Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital content creation, content creation, highly desirable, desirable for digital, digital content
备注: Code: [this https URL](https://github.com/ROUJINN/SceneAssistant)
点击查看摘要
Abstract:Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at this https URL
17. 【2603.12222】HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers
链接:https://arxiv.org/abs/2603.12222
作者:Andy Li,Aiden Durrant,Milan Markovic,Georgios Leontidis
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision Transformers require, Transformers require significant, Vision Transformers, require significant computational, significant computational resources
备注: 14 pages, 9 figures, 3 Tables
点击查看摘要
Abstract:Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.
18. 【2603.12221】A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
链接:https://arxiv.org/abs/2603.12221
作者:Jiajun Sun,Zhe Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Affective Behavior Analysis, facial emotional expressions, Affective Behavior, Behavior Analysis, requires frame-level classification
备注: 10 pages, 4 figures
点击查看摘要
Abstract:This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
19. 【2603.12217】Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
链接:https://arxiv.org/abs/2603.12217
作者:Görkay Aydemir,Fatma Güney,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large synthetic datasets, long-term point tracking, synthetic datasets, long-term point, point tracking
备注: CVPR 2026
点击查看摘要
Abstract:Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: this https URL
20. 【2603.12215】RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images
链接:https://arxiv.org/abs/2603.12215
作者:Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Yaoqi Sun,Sam Kwong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remote sensing images, sensing images faces, images faces significant, faces significant challenges, significant challenges due
备注:
点击查看摘要
Abstract:Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.
21. 【2603.12208】ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
链接:https://arxiv.org/abs/2603.12208
作者:Yingxin Lai,Zitong Yu,Jun Wang,Linlin Shen,Yong Xu,Xiaochun Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, enable interpretable multimedia, Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
22. 【2603.12193】SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
链接:https://arxiv.org/abs/2603.12193
作者:Mengzhen Liu,Enshen Zhou,Cheng Chi,Yi Han,Shanyu Rong,Liming Chen,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:complex scenes, crucial for robots, robots to interact, interact with complex, Active perception
备注: Accepted to CVPR 2026. See project page at [this https URL](https://lmzpai.github.io/SaPaVe)
点击查看摘要
Abstract:Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(\pi_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: this https URL
23. 【2603.12176】BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
链接:https://arxiv.org/abs/2603.12176
作者:Jingyang Ke,Weihan Li,Amartya Pradhan,Jeffrey Markowitz,Anqi Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:freely moving animal, linking neural activity, Understanding freely moving, moving animal behavior, behavioral understanding form
备注:
点击查看摘要
Abstract:Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
24. 【2603.12166】LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning
链接:https://arxiv.org/abs/2603.12166
作者:Haiying Xu,Zihan Wang,Song Dai,Zhengxuan Zhang,Kairan Dou,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, language models, multimodal large, recent advances
备注:
点击查看摘要
Abstract:Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.
25. 【2603.12155】GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
链接:https://arxiv.org/abs/2603.12155
作者:Zexuan Yan,Jiarui Jin,Yue Ma,Shijian Wang,Jiahui Hu,Wenxiang Jiao,Yuan Lu,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:driving significant progress, accurately generating complex, generating complex text, mathematical formulas remains, generative models driving
备注:
点击查看摘要
Abstract:Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at this https URL.
26. 【2603.12149】Linking Perception, Confidence and Accuracy in MLLMs
链接:https://arxiv.org/abs/2603.12149
作者:Yuetian Du,Yucheng Wang,Rongyu Zhang,Zhijie Xu,Boyu Yang,Ming Kong,Jie Liu,Qiang Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, Recent advances
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
27. 【2603.12147】EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
链接:https://arxiv.org/abs/2603.12147
作者:Ye Pan,Chi Kit Wong,Yuanhuiyi Lyu,Hanqian Li,Jiahao Huo,Jiacheng Chen,Lutao Jiang,Xu Zheng,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated remarkable video
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.
28. 【2603.12146】FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
链接:https://arxiv.org/abs/2603.12146
作者:Quanhao Li,Zhen Xing,Rui Wang,Haidong Cao,Qi Dai,Daoguo Dong,Zuxuan Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:achieved remarkable progress, Recent advances, trajectory-controllable video generation, video, video generation
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
29. 【2603.12144】O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
链接:https://arxiv.org/abs/2603.12144
作者:Mengfei Duan,Hao Shi,Fei Teng,Guoqiang Zhao,Yuheng Zhang,Zhiyong Li,Kailun Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:Understanding and reconstructing, Omnidirectional Open-vocabulary Occupancy, Open-vocabulary Occupancy predictioN, inevitable trend, development of autonomous
备注: The source code will be made publicly available at [this https URL](https://github.com/MengfeiD/O3N)
点击查看摘要
Abstract:Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at this https URL.
30. 【2603.12138】HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
链接:https://arxiv.org/abs/2603.12138
作者:Rui Shao,Ruize Gao,Bin Xie,Yixing Li,Kaiwen Zhou,Shuai Wang,Weili Guan,Gongwei Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Graphical user interface, large vision-language models, shown remarkable potential, Graphical user, effective agent training
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.
31. 【2603.12126】Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D
链接:https://arxiv.org/abs/2603.12126
作者:Agniv Sharma,Xianghui Xie,Tom Fischer,Eddy Ilg,Gerard Pons-Moll
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Modeling and generating, crucial for applications, high-quality interaction data, Modeling, interaction
备注:
点击查看摘要
Abstract:Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
32. 【2603.12120】CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
链接:https://arxiv.org/abs/2603.12120
作者:Leo Lin,Shivansh Patel,Jay Moon,Svetlana Lazebnik,Unnat Jain
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:hybrid hard-soft compliance, tendon-driven anthropomorphic hand, introduce CRAFT hand, contact-rich manipulation, tendon-driven anthropomorphic
备注:
点击查看摘要
Abstract:We introduce CRAFT hand, a tendon-driven anthropomorphic hand with hybrid hard-soft compliance for contact-rich manipulation. The design is based on a simple idea: contact is not uniform across the hand. Impacts concentrate at joints, while links carry most of the load. CRAFT places soft material at joints and keeps links rigid, and uses rollingcontact joint surfaces to keep flexion on repeatable motion paths. Fifteen motors mounted on the fingers drive the hand through tendons, keeping the form factor compact and the fingers light. In structural tests, CRAFT improves strength and endurance while maintaining comparable repeatability. In teleoperation, CRAFT improves handling of fragile and low-friction items, and the hand covers 33/33 grasps in the Feix taxonomy. The full design costs under $600 and will be released open-source with visionbased teleoperation and simulation integration. Project page: this http URL
33. 【2603.12108】EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation
链接:https://arxiv.org/abs/2603.12108
作者:Yan Li,Ning Liao,Xiangyu Zhao,Shaofeng Zhang,Xiaoxing Wang,Yifan Yang,Junchi Yan,Xue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demands fine-grained pixel-level, fine-grained pixel-level representations, generation demands fine-grained, fundamentally challenged, granularity gap
备注:
点击查看摘要
Abstract:The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.
34. 【2603.12083】owards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis
链接:https://arxiv.org/abs/2603.12083
作者:Xiaolong Qian,Qi Jiang,Yao Gao,Lei Sun,Zhonghua Yi,Kailun Yang,Luc Van Gool,Kaiwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Optics (physics.optics)
关键词:Prevalent Computational Aberration, Computational Aberration Correction, Prevalent Computational, Aberration Correction, Computational Aberration
备注: Accepted to CVPR 2026. Benchmarks, codes, and Zemax files will be available at [this https URL](https://github.com/XiaolongQian/UniCAC)
点击查看摘要
Abstract:Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at this https URL.
35. 【2603.12078】Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
链接:https://arxiv.org/abs/2603.12078
作者:Hiran Sarkar,Liming Kuang,Yordanka Velikova,Benjamin Busam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Predicting scene dynamics, Predicting scene, observations is challenging, Neural Radiance Fields, Ordinary Differential Equations
备注: Accepted to CVPR 2026. 13 pages, 9 figures
点击查看摘要
Abstract:Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions.
36. 【2603.12071】Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
链接:https://arxiv.org/abs/2603.12071
作者:Zhaoyang Jiang,Zhizhong Fu,David McAllister,Yunsoo Kim,Honghan Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Alzheimer disease assessment, Alzheimer disease, Mild Cognitive Impairment, Longitudinal brain MRI, neurological diseases
备注:
点击查看摘要
Abstract:Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at this https URL.
37. 【2603.12067】Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing
链接:https://arxiv.org/abs/2603.12067
作者:Simone Cammarasana
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:fundamental building block, modern convolutional neural, convolutional neural networks, efficient implementation, fundamental building
备注:
点击查看摘要
Abstract:The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions -- linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks -- and outline the open challenges and future directions of this research area.
38. 【2603.12064】Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos
链接:https://arxiv.org/abs/2603.12064
作者:Shuo Sun,Unal Artan,Malcolm Mielle,Achim J. Lilienthaland,Martin Magnusson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple freely moving, freely moving cameras, multiple observers capture, dense dynamic scene, dynamic scene reconstruction
备注:
点击查看摘要
Abstract:We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.
39. 【2603.12063】NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction
链接:https://arxiv.org/abs/2603.12063
作者:David Svitov,Mahtab Dahaghin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:handling non-rigid deformations, head avatars handling, avatars handling non-rigid, non-rigid deformations caused, handling non-rigid
备注:
点击查看摘要
Abstract:We present NBAvatar - a method for realistic rendering of head avatars handling non-rigid deformations caused by hand-face interaction. We introduce a novel representation for animated avatars by combining the training of oriented planar primitives with neural rendering. Such a combination of explicit and implicit representations enables NBAvatar to handle temporally and pose-consistent geometry, along with fine-grained appearance details provided by the neural rendering technique. In our experiments, we demonstrate that NBAvatar implicitly learns color transformations caused by face-hand interactions and surpasses existing approaches in terms of novel-view and novel-pose rendering quality. Specifically, NBAvatar achieves up to 30% LPIPS reduction under high-resolution megapixel rendering compared to Gaussian-based avatar methods, while also improving PSNR and SSIM, and achieves higher structural similarity compared to the state-of-the-art hand-face interaction method InteractAvatar.
40. 【2603.12057】Coarse-Guided Visual Generation via Weighted h-Transform Sampling
链接:https://arxiv.org/abs/2603.12057
作者:Yanghao Wang,Ziqi Jiang,Zhen Wang,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:low-fidelity coarse references, Coarse-guided visual generation, Coarse-guided visual, synthesizes fine visual, coarse references
备注:
点击查看摘要
Abstract:Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
41. 【2603.12055】Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
链接:https://arxiv.org/abs/2603.12055
作者:Chiyuan He,Zihuan Qiu,Fanman Meng,Runtong Zhang,Linfeng Xu,Qingbo Wu,Hongliang Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:pretrained vision-language models, induce geometric distortion, current approaches adapt, allowing new-task supervision, Continual learning
备注: 14 pages, 11 figures, under review
点击查看摘要
Abstract:Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.
42. 【2603.12036】Single Pixel Image Classification using an Ultrafast Digital Light Projector
链接:https://arxiv.org/abs/2603.12036
作者:Aisha Kanwal,Graeme E. Johnstone,Fahimeh Dehkhoda,Johannes H. Herrnsdorf,Robert K. Henderson,Martin D. Dawson,Xavier Porte,Michael J. Strain
类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
关键词:image classification, Pattern recognition, image, machine vision, classification
备注:
点击查看摘要
Abstract:Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.
43. 【2603.12016】Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era
链接:https://arxiv.org/abs/2603.12016
作者:Nicholas Schaub,Andriy Kharchenko,Hamdah Abbasi,Sameeul Samee,Hythem Sidky,Nathan Hotaling
类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:Modern imaging instruments, Modern imaging, single experiment, imaging instruments, instruments can produce
备注: 29 pages, 9 figures, 6 supplemental tables
点击查看摘要
Abstract:Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.
44. 【2603.12013】Pano360: Perspective to Panoramic Vision with Geometric Consistency
链接:https://arxiv.org/abs/2603.12013
作者:Zhengdong Zhu,Weiyi Xue,Zuyuan Yang,Wenlve Zhou,Zhiheng Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Prior panorama stitching, panorama stitching approaches, stitching approaches heavily, approaches heavily rely, leverage geometric consistency
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.
45. 【2603.12008】CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation
链接:https://arxiv.org/abs/2603.12008
作者:Ziqi Ye,Ziyang Gong,Ning Liao,Xiaoxing Hu,Di Wang,Hongruixuan Chen,Chen Huang,Yiguo He,Yuru Jia,Xiaoxing Wang,Haipeng Wang,Xue Yang,Junchi Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, all-weather earth observation, enables global
备注: 26 pages, 15 figures
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10\% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.
46. 【2603.11984】Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation
链接:https://arxiv.org/abs/2603.11984
作者:Chongyang Xu,Yixian Zou,Ziliang Feng,Fanman Meng,Shuaicheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visuomotor policies effectively, Diffusion-based visuomotor policies, latency limits real-time, high inference latency, inference latency limits
备注:
点击查看摘要
Abstract:Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.
47. 【2603.11975】HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
链接:https://arxiv.org/abs/2603.11975
作者:Jiayue Pu,Zhongxiang Sun,Zilu Zhang,Xiao Zhang,Jun Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:real-world environments, rapid evolution, evolution of embodied, embodied agents, agents has accelerated
备注:
点击查看摘要
Abstract:The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
48. 【2603.11971】Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
链接:https://arxiv.org/abs/2603.11971
作者:Junhyeong Byeon,Jeongyeol Kim,Sejoon Lim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:challenging problem due, inherently dynamic nature, video data remains, Affective Behavior Analysis, background noise
备注: 7 pages
点击查看摘要
Abstract:Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.
Comments:
7 pages
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2603.11971 [cs.CV]
(or
arXiv:2603.11971v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.11971
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
49. 【2603.11969】AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies
链接:https://arxiv.org/abs/2603.11969
作者:Jennifer Nolan,Travis Driver,John Christian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:small celestial bodies, Image-based surface reconstruction, informs mission planning, celestial bodies, scientific analysis
备注: 10 pages, 6 figures, conference
点击查看摘要
Abstract:Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.
50. 【2603.11952】Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments
链接:https://arxiv.org/abs/2603.11952
作者:Pankaj Deoli,Karthik Ranganath,Karsten Berns
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:RGB-NIR image registration, RGB-NIR image, role in sensor-fusion, image registration plays, plays an important
备注: Preliminary results
点击查看摘要
Abstract:RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.
51. 【2603.11938】Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting
链接:https://arxiv.org/abs/2603.11938
作者:Chantal Pellegrini,Adrian Delchev,Ege Özsoy,Nassir Navab,Matthias Keicher
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:automation remains difficult, limited structured supervision, reporting promises faster, radiology reporting promises, promises faster
备注:
点击查看摘要
Abstract:Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.
52. 【2603.11917】PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation
链接:https://arxiv.org/abs/2603.11917
作者:Pietro Bonazzi,Nicola Farronato,Stefan Zihlmann,Haotong Qin,Michele Magno
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Efficient Channel Attention, critical for latency-sensitive, latency-sensitive and privacy-aware, privacy-aware applications, smart glasses
备注:
点击查看摘要
Abstract:Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.
53. 【2603.11911】InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
链接:https://arxiv.org/abs/2603.11911
作者:InSpatio Team:Xiaoyu Zhang,Weihong Pan,Zhichao Ye,Jialin Liu,Yipeng Chen,Nan Wang,Xiaojun Xiang,Weijian Xie,Yifu Wang,Haoyu Ji,Siji Pan,Zhewen Le,Jing Guo,Xianbin Liu,Donghui Shen,Ziqiang Zhao,Haomin Liu,Guofeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open-source real-time frame, spatial intelligence, video-based world models, present InSpatio-WorldFM, frame
备注: Project page: [this https URL](https://inspatio.github.io/worldfm/) Code: [this https URL](https://github.com/inspatio/worldfm)
点击查看摘要
Abstract:We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.
54. 【2603.11896】hink While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
链接:https://arxiv.org/abs/2603.11896
作者:Lu Wang(1),Zhuoran Jin(1),Yupu Hao(1),Yubo Chen(1),Kang Liu(1),Yulong Ao(2),Jun Zhao(1) ((1) The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, (2) Beijing Academy of Artificial Intelligence (BAAI), Beijing, China)
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal large language, large language models, offline video understanding, Multimodal large, continuously arriving video
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: this https URL
55. 【2603.11888】Single-View Rolling-Shutter SfM
链接:https://arxiv.org/abs/2603.11888
作者:Sofía Errázuriz Muñoz,Kim Kiehn,Petr Hruby,Kathlén Kohn
类目:Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
关键词:cameras are ubiquitous, fully solved, Rolling-shutter, Abstract, observed world points
备注:
点击查看摘要
Abstract:Rolling-shutter (RS) cameras are ubiquitous, but RS SfM (structure-from-motion) has not been fully solved yet. This work suggests an approach to remedy this: We characterize RS single-view geometry of observed world points or lines. Exploiting this geometry, we describe which motion and scene parameters can be recovered from a single RS image and systematically derive minimal reconstruction problems. We evaluate several representative cases with proof-of-concept solvers, highlighting both feasibility and practical limitations.
56. 【2603.11866】Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration
链接:https://arxiv.org/abs/2603.11866
作者:Zhaocheng Yu,Xiang Chen,Runzhe Li,Zihan Geng,Guanglu Sun,Haipeng Li,Kui Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:static inference paradigm, existing models suffer, advanced single-image deraining, coupled degradations, fundamental limitation
备注:
点击查看摘要
Abstract:While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.
57. 【2603.11846】ZeroSense:How Vision matters in Long Context Compression
链接:https://arxiv.org/abs/2603.11846
作者:Yonghan Gao,Zehong Chen,Lijian Xu,Jingzhi Chen,Jingwei Guan,Xingyu Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent visual-text compression, report impressive high, token compression ratios, high token compression, impressive high token
备注:
点击查看摘要
Abstract:Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.
58. 【2603.11836】A Decade of Generative Adversarial Networks for Porous Material Reconstruction
链接:https://arxiv.org/abs/2603.11836
作者:Ali Sadeghkhani,Brandon Bennett,Masoud Babaei,Arash Rabbani
类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Geophysics (physics.geo-ph)
关键词:electrochemical device design, geological reservoir characterization, Generative Adversarial Networks, Digital reconstruction, device design
备注: 96 pages, supplementary material included (34 pages, 6 tables covering all 96 reviewed implementations)
点击查看摘要
Abstract:Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.
59. 【2603.11831】owards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding
链接:https://arxiv.org/abs/2603.11831
作者:Jiahao Li,Qingwang Zhang,Qiuyu Chen,Guozhan Qiu,Yunzhong Lou,Xiangdong Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made significant progress, CAD, recent years, field of Computer-Aided, made significant
备注: preprint
点击查看摘要
Abstract:The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.
60. 【2603.11827】Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning
链接:https://arxiv.org/abs/2603.11827
作者:Robin Peretzke,Marlin Hanstein,Maximilian Fischer,Lars Badhi Wessel,Obada Alhalabi,Sebastian Regnery,Andreas Kudak,Maximilian Deng,Tanja Eichkorn,Philipp Hoegen Saßmannshausen,Fabian Allmendinger,Jan-Hendrik Bolten,Philipp Schröter,Christine Jungk,Jürgen Peter Debus,Peter Neher,Laila König,Klaus Maier-Hein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:radiation-induced contrast enhancements, post-treatment glioblastoma patients, glioblastoma patients remains, major clinical challenge, recurrence and radiation-induced
备注:
点击查看摘要
Abstract:The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model's focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.
61. 【2603.11818】Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI
链接:https://arxiv.org/abs/2603.11818
作者:Md. Hasin Sarwar Ifty,Nisharga Nirjan,Labib Islam,M. A. Diganta,Reeyad Ahmed Ornate,Anika Tasnim,Md. Saiful Islam
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:unrestrained proliferation, proliferation of cells, malignant in nature, ovarian cancer, Convolutional Neural Networks
备注: Accepted and published at ICAIC 2025. Accepted version
点击查看摘要
Abstract:The unrestrained proliferation of cells that are malignant in nature is cancer. In recent times, medical professionals are constantly acquiring enhanced diagnostic and treatment abilities by implementing deep learning models to analyze medical data for better clinical decision, disease diagnosis and drug discovery. A majority of cancers are studied and treated by incorporating these technologies. However, ovarian cancer remains a dilemma as it has inaccurate non-invasive detection procedures and a time consuming, invasive procedure for accurate detection. Thus, in this research, several Convolutional Neural Networks such as LeNet-5, ResNet, VGGNet and GoogLeNet/Inception have been utilized to develop 15 variants and choose a model that accurately detects and identifies ovarian cancer. For effective model training, the dataset OvarianCancerSubtypesDatasetHistopathology from Mendeley has been used. After constructing a model, we utilized Explainable Artificial Intelligence (XAI) models such as LIME, Integrated Gradients and SHAP to explain the black box outcome of the selected model. For evaluating the performance of the model, Accuracy, Precision, Recall, F1-Score, ROC Curve and AUC have been used. From the evaluation, it was seen that the slightly compact InceptionV3 model with ReLu had the overall best result achieving an average score of 94% across all the performance metrics in the augmented dataset. Lastly for XAI, the three aforementioned XAI have been used for an overall comparative analysis. It is the aim of this research that the contributions of the study will help in achieving a better detection method for ovarian cancer.
62. 【2603.11811】RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset
链接:https://arxiv.org/abs/2603.11811
作者:Yongzhong Wang,Keyu Zhu,Yong Zhong,Liqiong Wang,Jinyu Yang,Feng Zheng
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:modern robot learning, large-scale physical interaction, critical prerequisite, prerequisite for modern, modern robot
备注: 8 pages, 4 figures. Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
点击查看摘要
Abstract:The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR's exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.
63. 【2603.11810】CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing
链接:https://arxiv.org/abs/2603.11810
作者:Yue Shi,Rui Shi,Yuxuan Xiong,Bingbing Ni,Wenjun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deeply integrated nature, unrefined results due, handler points, produce unrealistic, unrealistic and unrefined
备注:
点击查看摘要
Abstract:Existing 3D editing methods often produce unrealistic and unrefined results due to the deeply integrated nature of their reconstruction networks. To address the challenge, this paper introduces CEI-3D, an editing-oriented reconstruction pipeline designed to facilitate realistic and fine-grained editing. Specifically, we propose a collaborative explicit-implicit reconstruction approach, which represents the target object using an implicit SDF network and a differentially sampled, locally controllable set of handler points. The implicit network provides a smooth and continuous geometry prior, while the explicit handler points offer localized control, enabling mutual guidance between the global 3D structure and user-specified local editing regions. To independently control each attribute of the handler points, we design a physical properties disentangling module to decouple the color of the handler points into separate physical properties. We also propose a dual-diffuse-albedo network in this module to process the edited and non-edited regions through separate branches, thereby preventing undesired interference from editing operations. Building on the reconstructed collaborative explicit-implicit representation with disentangled properties, we introduce a spatial-aware editing module that enables part-wise adjustment of relevant handler points. This module employs a cross-view propagation-based 3D segmentation strategy, which helps users to edit the specified physical attributes of a target part efficiently. Extensive experiments on both real and synthetic datasets demonstrate that our approach achieves more realistic and fine-grained editing results than the state-of-the-art (SOTA) methods while requiring less editing time. Our code is available on this https URL.
64. 【2603.11804】OSM-based Domain Adaptation for Remote Sensing VLMs
链接:https://arxiv.org/abs/2603.11804
作者:Stefan Maria Ailuro,Mario Markov,Mohammad Mahdi,Delyan Boychev,Luc Van Gool,Danda Pani Paudel(INSAIT, Sofia University "St. Kliment Ohridski")
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:domain-specific image-text supervision, imagery remain scarce, sensing rely heavily, image-text supervision, expensive to produce
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
65. 【2603.11795】Intrinsic Concept Extraction Based on Compositional Interpretability
链接:https://arxiv.org/abs/2603.11795
作者:Hanyu Shi,Hong Tao,Guoheng Huang,Jianbin Jiang,Xuhang Chen,Chi-Man Pun,Shanhu Wang,Pan Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unsupervised Concept Extraction, Intrinsic Concept Extraction, Concept Extraction aims, Concept Extraction, existing methods suffer
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.
66. 【2603.11793】Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder
链接:https://arxiv.org/abs/2603.11793
作者:Alaa Yasser,Kittipat Phunjanna,Marcos Escudero Viñolo,Catarina Barata,Jenny Benois-Pineau
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Standard fairness audits, Concept Activation Vectors, Standard fairness, foundation models quantify, zero-shot Concept Activation
备注: 14 pages, 6 tables, 2 figures. Work conducted during IPCV-AI Erasmus Mundus Master
点击查看摘要
Abstract:Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 - 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness
Comments:
14 pages, 6 tables, 2 figures. Work conducted during IPCV-AI Erasmus Mundus Master
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2603.11793 [cs.CV]
(or
arXiv:2603.11793v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.11793
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
67. 【2603.11783】HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification
链接:https://arxiv.org/abs/2603.11783
作者:Marjan Stoimchev,Boshko Koloski,Jurica Levatić,Dragi Kocev,Sašo Džeroski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Hierarchical multi-label classification, complex label dependencies, modeling complex label, multi-label classification, Explicit Label Modeling
备注: Accepted and presented at REO workshop at EurIPS 2025
点击查看摘要
Abstract:Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.
68. 【2603.11755】Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
链接:https://arxiv.org/abs/2603.11755
作者:Chenyangguang Zhang,Botao Ye,Boqi Chen,Alexandros Delitzas,Fangjinhua Wang,Marc Pollefeys,Xi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Motion-controllable video generation, Motion-controllable video, generation is crucial, applications in virtual, virtual reality
备注:
点击查看摘要
Abstract:Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
69. 【2603.11746】SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
链接:https://arxiv.org/abs/2603.11746
作者:Dingcheng Zhen,Xu Zheng,Ruixin Zhang,Zhiqi Jiang,Yichao Yan,Ming Tao,Shunshun Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion models offer, combining diffusion modeling, sequential generation tasks, models offer, offer a promising
备注:
点击查看摘要
Abstract:Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.
70. 【2603.11734】VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On
链接:https://arxiv.org/abs/2603.11734
作者:Xiaoye Liang,Zhiyuan Qu,Mingye Zou,Jiaxin Liu,Lai Jiang,Mai Xu,Yiheng Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:VTON, continues to advance, existing specialized VTON, virtual try-on, growing number
备注:
点击查看摘要
Abstract:As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.
71. 【2603.11725】Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction
链接:https://arxiv.org/abs/2603.11725
作者:Ammar Kheder,Helmi Toropainen,Wenqing Peng,Samuel Antão,Zhi-Song Liu,Michael Boy
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:continent-scale domains required, real-world environmental monitoring, achieved remarkable success, scalability remains limited, dual-branch Vision Transformer
备注:
点击查看摘要
Abstract:Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.
72. 【2603.11717】COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection
链接:https://arxiv.org/abs/2603.11717
作者:Guillem González,Guillem Alenyà,Sergi Foix
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fibre degradation, critical phase, physically manipulated, lead to fibre, cotton capsules
备注: 15 pages, 11 figures. This paper will be submitted to Computers and Electronics in Agriculture, special issue
点击查看摘要
Abstract:Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton's intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.
73. 【2603.11698】OSCBench: Benchmarking Object State Change in Text-to-Video Generation
链接:https://arxiv.org/abs/2603.11698
作者:Xianjing Han,Bin Zhu,Shiqi Hu,Franklin Mingzhe Li,Patrick Carrington,Roger Zimmermann,Jingjing Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:made rapid progress, producing visually high-quality, made rapid, rapid progress, progress in producing
备注: Project page: [this https URL](https://hanxjing.github.io/OSCBench)
点击查看摘要
Abstract:Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
74. 【2603.11695】PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures
链接:https://arxiv.org/abs/2603.11695
作者:Chi Chen,Tianle Jiang,Xiaodong Wei,Yanming Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
关键词:exert a critical, critical influence, polycrystalline materials exert, polycrystalline materials, polycrystalline
备注:
点击查看摘要
Abstract:The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.
75. 【2603.11680】UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
链接:https://arxiv.org/abs/2603.11680
作者:Cao Thien Tan,Phan Thi Thu Trang,Do Nghiem Duc,Ho Ngoc Anh,Hanyang Zhuang,Nguyen Duc Dung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Hybrid CNN-Transformer architectures, increases computational cost, Hybrid CNN-Transformer, scaling attention windows, computational cost
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.
76. 【2603.11675】PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
链接:https://arxiv.org/abs/2603.11675
作者:Haohua Chen,Tianze Zhou,Wei Zhu,Runqi Wang,Yandong Guan,Dejia Song,Yibo Chen,Xu Tang,Yao Hu,Lu Sheng,Zhiyong Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reliable fit guidance, provide reliable fit, results provide reliable, online retail, fit guidance
备注: CVPR 2026
点击查看摘要
Abstract:Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.
77. 【2603.11664】BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder
链接:https://arxiv.org/abs/2603.11664
作者:Siquan Huang,Yijiang Li,Ningzhi Gao,Xingfu Yan,Leyu Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:learn strong visual, strong visual representations, Self-supervised and multimodal, encoders learn strong, downstream vision tasks
备注: 17 pages, 10 figures, 6 tables
点击查看摘要
Abstract:Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.
78. 【2603.11659】FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation
链接:https://arxiv.org/abs/2603.11659
作者:Meilu Zhu,Zhiwei Wang,Axiu Mao,Yuxing Li,Xiaohan Xing,Yixuan Yuan,Edmund Y. Lam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sharing raw data, collaborative medical image, medical image segmentation, medical image, offers a privacy-preserving
备注: 19 pages,4 figures
点击查看摘要
Abstract:Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at this https URL.
79. 【2603.11647】OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
链接:https://arxiv.org/abs/2603.11647
作者:Yaofeng Su,Yuming Li,Zeyue Xue,Jie Huang,Siming Fu,Haoran Li,Ying Li,Zezhong Qian,Haoyang Huang,Nan Duan
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:hindering real-time applications, Recent joint audio-visual, bidirectional attention dependencies, high latency due, audio-visual diffusion models
备注: 14 pages
点击查看摘要
Abstract:Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{this https URL}{this https URL}
80. 【2603.11644】IDRL: An Individual-Aware Multimodal Depression-Related Representation Learning Framework for Depression Diagnosis
链接:https://arxiv.org/abs/2603.11644
作者:Chongxiao Wang,Junjie Liang,Peng Cao,Jinzhu Yang,Osmar R. Zaiane
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:severe mental disorder, reliable identification plays, mental disorder, intervention and treatment, severe mental
备注:
点击查看摘要
Abstract:Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.
81. 【2603.11640】okenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
链接:https://arxiv.org/abs/2603.11640
作者:Sizhong Qin,Ramon Elias Weber,Xinzheng Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:design demands joint, Architectural floor plan, plan design demands, demands joint reasoning, floor plan design
备注: 20 pages, 9 figures. Accepted to CVPR 2026
点击查看摘要
Abstract:Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.
82. 【2603.11633】MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
链接:https://arxiv.org/abs/2603.11633
作者:Baicheng Li,Dong Wu,Jun Li,Shunkai Zhou,Zecui Zeng,Lusong Li,Hongbin Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made remarkable progress, Recent unified, producing high-quality, single image, models have made
备注:
点击查看摘要
Abstract:Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at this https URL.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2603.11633 [cs.CV]
(or
arXiv:2603.11633v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.11633
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
83. 【2603.11631】VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
链接:https://arxiv.org/abs/2603.11631
作者:Eunsoo Lee,Jeongwoo Lee,Minki Hong,Jangho Choi,Jihie Kim
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large vision-language models, Large vision-language, detect visual primitives, reliably detect visual, struggle to reliably
备注: 30 pages, 21 figures, EACL 2026 Findings
点击查看摘要
Abstract:Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
84. 【2603.11627】Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography
链接:https://arxiv.org/abs/2603.11627
作者:Yichi Zhang,Le Xue,Wenbo Zhang,Lanlan Li,Feiyang Xiao,Yuchen Liu,Xiaohui Zhang,Hongwei Zhang,Shuqi Wang,Gang Feng,Liling Peng,Xin Gao,Yuanfan Xu,Yuan Qi,Kuangyu Shi,Hong Zhang,Yuan Cheng,Mei Tian,Zixin Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Positron emission tomography, key nuclear medicine, visualizes radiotracer distributions, Positron emission, medicine imaging modality
备注:
点击查看摘要
Abstract:Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET's paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.
85. 【2603.11625】MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models
链接:https://arxiv.org/abs/2603.11625
作者:Shengyuan Liu,Zanting Ye,Yunrui Lin,Chen Hu,Wanting Geng,Xu Han,Bulat Ibragimov,Yefeng Zheng,Yixuan Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:volumetric data remains, achieved remarkable success, data remains constrained, significant computational inefficiencies, specialized Medical Vision-Language
备注: 10 pages
点击查看摘要
Abstract:While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.
86. 【2603.11618】Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
链接:https://arxiv.org/abs/2603.11618
作者:Jiin Im,Sisung Liu,Je Hyeong Hong
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:images lacking explicit, handling diverse, images lacking, essential for handling, lacking explicit correspondence
备注: Accepted at CVPR 2026. Supplementary material included after references. 18 pages, 11 figures, 10 tables
点击查看摘要
Abstract:Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.
87. 【2603.11617】Noise-aware few-shot learning through bi-directional multi-view prompt alignment
链接:https://arxiv.org/abs/2603.11617
作者:Lu Niu,Cheng Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language models offer, models offer strong, offer strong few-shot, strong few-shot capability, degrade cross-modal alignment
备注:
点击查看摘要
Abstract:Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.
88. 【2603.11616】SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation
链接:https://arxiv.org/abs/2603.11616
作者:Muyi Sun,Yifan Gao,Ziang Jia,Xingqun Qi,Qianli Zhang,Qian Liu,Tianzheng Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cone-Beam Computed Tomography, artificial intelligence, increasingly promising, Computed Tomography, rapid advancement
备注: 5 pages, 5 figures. Accepted to IEEE ICASSP 2026
点击查看摘要
Abstract:With the rapid advancement of artificial intelligence, intelligent dentistry for clinical diagnosis and treatment has become increasingly promising. As the primary clinical dentistry task, tooth structure segmentation for Cone-Beam Computed Tomography (CBCT) has made significant progress in recent years. However, challenges arise from the obtainment difficulty of full-annotated data, and the acquisition variability of multi-source data across different institutions, which have caused low-quality utilization, voxel-level inconsistency, and domain-specific disparity in CBCT slices. Thus, the rational and efficient utilization of multi-source and unlabeled data represents a pivotal problem. In this paper, we propose SemiTooth, a generalizable semi-supervised framework for multi-source tooth segmentation. Specifically, we first compile MS3Toothset, Multi-Source Semi-Supervised Tooth DataSet for clinical dental CBCT, which contains data from three sources with different-level annotations. Then, we design a multi-teacher and multi-student framework, i.e., SemiTooth, which promotes semi-supervised learning for multi-source data. SemiTooth employs distinct student networks that learn from unlabeled data with different sources, supervised by its respective teachers. Furthermore, a Stricter Weighted-Confidence Constraint is introduced for multiple teachers to improve the multi-source this http URL experiments are conducted on MS3Toothset to verify the feasibility and superiority of the SemiTooth framework, which achieves SOTA performance on the semi-supervised and multi-source tooth segmentation scenario.
89. 【2603.11607】DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling
链接:https://arxiv.org/abs/2603.11607
作者:Tong Zhao,Mingkun Lei,Liangyu Yuan,Yanming Yang,Chenxi Song,Yang Wang,Beier Zhu,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:process remains prohibitively, remains prohibitively slow, prohibitively slow due, sampling process remains, generative performance
备注: Code Link: see [this https URL](https://github.com/Westlake-AGI-Lab/DyWeight)
点击查看摘要
Abstract:Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at this https URL
90. 【2603.11606】Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints
链接:https://arxiv.org/abs/2603.11606
作者:Lijun Guo,Haoyu Zhao,Xingyue Zhao,Rong Fu,Linghao Zhuang,Siteng Huang,Zhongyu Li,Hua Zou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual data remains, Building high-fidelity digital, central challenge, visual data, data remains
备注: 26 pages, 12 figures
点击查看摘要
Abstract:Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at this https URL.
91. 【2603.11605】LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
链接:https://arxiv.org/abs/2603.11605
作者:Junkun Jiang,Ho Yin Au,Jingyu Xiang,Jie Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthesize temporally accurate, prevailing methods relying, methods relying heavily, joint text-motion embeddings, text-motion embeddings struggle
备注: Accepted by CVPR 2026. Supplementary material included. Project page: [this https URL](https://jjkislele.github.io/LaMoGen/)
点击查看摘要
Abstract:Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.
92. 【2603.11593】WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
链接:https://arxiv.org/abs/2603.11593
作者:Hui Zhang,Juntao Liu,Zongkai Liu,Liqiang Niu,Fandong Meng,Zuxuan Wu,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving non-target regions, Instruction-based image editing, Instruction-based image, modify specific content, image editing aims
备注:
点击查看摘要
Abstract:Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.
93. 【2603.11566】R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
链接:https://arxiv.org/abs/2603.11566
作者:Zhongyu Xia,Yousen Tang,Yongtao Wang,Zhifeng Wang,Weijun Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:radar-camera sensing configuration, gained increasing importance, radar-camera sensing, autonomous driving, sensing configuration
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.
94. 【2603.11563】SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning
链接:https://arxiv.org/abs/2603.11563
作者:Yuyuan Yang,Junkun Hong,Hongrong Wang,Honghao Cai,Xunpeng Ren,Ge Wang,Mingcong Lei,Shenhao Yan,Jiahao Yang,Chengsi Yao,Xi Li,Yiming Zhao,Yatong Han,Jinke Ren
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:generate action sequences, coherent over time, planning demands vision-language, visually grounded, grounded and causally
备注:
点击查看摘要
Abstract:Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
95. 【2603.11557】ornadoNet: Real-Time Building Damage Detection with Ordinal Supervision
链接:https://arxiv.org/abs/2603.11557
作者:Robinson Umeike,Cuong Pham,Ryan Hausen,Thang Dao,Shane Crawford,Tanya Brown-Giammanco,Gerard Lemson,John van de Lindt,Blythe Johnston,Arik Mitschang,Trung Do
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:realistic post-disaster conditions, modern real-time object, automated street-level building, real-time object detection, damage assessment evaluating
备注:
点击查看摘要
Abstract:We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model Data: this https URL
96. 【2603.11556】Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception
链接:https://arxiv.org/abs/2603.11556
作者:Xinyu Nan,Ning Wang,Yuyao Zhai,Mei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Image aesthetic enhancement, perceive aesthetic deficiencies, aesthetic, Image aesthetic, aesthetic perception capabilities
备注:
点击查看摘要
Abstract:Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.
97. 【2603.11554】MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
链接:https://arxiv.org/abs/2603.11554
作者:Lirong Che,Shuo Wen,Shan Huang,Chuang Wang,Yuzhe Yang,Gregory Dudek,Xueqian Wang,Jian Su
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:span multiple floors, demanding rich spatial, Real-world robotic tasks, multiple floors, demanding rich
备注:
点击查看摘要
Abstract:Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.
98. 【2603.11551】Shadowless Projection Mapping for Tabletop Workspaces with Synthetic Aperture Projector
链接:https://arxiv.org/abs/2603.11551
作者:Takahiro Okamoto,Masaki Takeuchi,Masataka Sawayama,Daisuke Iwai
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:supports multi-user interaction, wear head-mounted displays, enables augmented reality, multi-user interaction, wear head-mounted
备注:
点击查看摘要
Abstract:Projection mapping (PM) enables augmented reality (AR) experiences without requiring users to wear head-mounted displays and supports multi-user interaction. It is regarded as a promising technology for a variety of applications in which users interact with content superimposed onto augmented objects in tabletop workspaces, including remote collaboration, healthcare, industrial design, urban planning, artwork creation, and office work. However, conventional PM systems often suffer from projection shadows when users occlude the light path. Prior approaches employing multiple distributed projectors can compensate for occlusion, but suffer from latency due to computational processing, degrading the user experience. In this research, we introduce a synthetic-aperture PM system that uses a significantly larger number of projectors, arranged densely in the environment, to achieve delay-free, shadowless projection for tabletop workspaces without requiring computational compensation. To address spatial resolution degradation caused by subpixel misalignment among overlaid projections, we develop and validate an offline blur compensation method whose computation time remains independent of the number of projectors. Furthermore, we demonstrate that our shadowless PM plays a critical role in achieving a fundamental goal of PM: altering material properties without evoking projection-like impression. Specifically, we define this perceptual impression as ``sense of projection (SoP)'' and establish a PM design framework to minimize the SoP based on user studies.
99. 【2603.11550】PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation
链接:https://arxiv.org/abs/2603.11550
作者:Xiangyu Li,Chenglin Wang,Qiantong Shen,Fanding Li,Wei Wang,Kuanquan Wang,Yi Shen,Baochun Zhao,Gongning Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Ambiguous Medical Image, Ambiguous Medical, subjective annotations, Medical Image Segmentation, significant to address
备注:
点击查看摘要
Abstract:Ambiguous Medical Image Segmentation (AMIS) is significant to address the challenges of inherent uncertainties from image ambiguities, noise, and subjective annotations. Existing conditional variational autoencoder (cVAE)-based methods effectively capture uncertainty but face limitations including redundancy in high-dimensional latent spaces and limited expressiveness of single posterior networks. To overcome these issues, we introduce a novel PCA-Enhanced Probabilistic U-Net (\textbf{PEP U-Net}). Our method effectively incorporates Principal Component Analysis (PCA) for dimensionality reduction in the posterior network to mitigate redundancy and improve computational efficiency. Additionally, we further employ an inverse PCA operation to reconstruct critical information, enhancing the latent space's representational capacity. Compared to conventional generative models, our method preserves the ability to generate diverse segmentation hypotheses while achieving a superior balance between segmentation accuracy and predictive variability, thereby advancing the performance of generative modeling in medical image segmentation.
100. 【2603.11543】Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting
链接:https://arxiv.org/abs/2603.11543
作者:Tingxuan Huang,Haowei Zhu,Jun-hai Yong,Hao Pan,Bin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reconstructing dynamic, strong temporal coherence, temporal coherence remains, significant challenge, photorealistic detail
备注:
点击查看摘要
Abstract:Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.
101. 【2603.11542】ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation
链接:https://arxiv.org/abs/2603.11542
作者:Md Jahidul Islam
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large-scale Vision-Language Models, extremely limited data, Vision-Language Models, limited data, large-scale Vision-Language
备注:
点击查看摘要
Abstract:The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data -- specifically in the one-shot regime -- is often hindered by a significant "Stability-Plasticity" dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at this https URL.
102. 【2603.11534】Risk-Controllable Multi-View Diffusion for Driving Scenario Generation
链接:https://arxiv.org/abs/2603.11534
作者:Hongyi Lin,Wenxiu Shi,Heye Huang,Dingyi Zhuang,Song Zhang,Yang Liu,Xiaobo Qu,Jinhua Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating safety-critical driving, autonomous driving systems, manual scenario design, Generating safety-critical, long-tail risky situations
备注:
点击查看摘要
Abstract:Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic this http URL on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.
103. 【2603.11531】Mobile-GS: Real-time Gaussian Splatting for Mobile Devices
链接:https://arxiv.org/abs/2603.11531
作者:Xiaobiao Du,Yida Wang,Kun Zhan,Xin Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, large storage costs, storage costs pose, costs pose significant, pose significant challenges
备注: Project Page: [this https URL](https://xiaobiaodu.github.io/mobile-gs-project/)
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of this http URL, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices. In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices. Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process. To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering. Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order. To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes. In this way, Mobile-GS can achieve both high-quality and real-time rendering. Furthermore, to facilitate deployment on memory-constrained mobile platforms, we also introduce first-order spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks. Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.
104. 【2603.11525】MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
链接:https://arxiv.org/abs/2603.11525
作者:Jian Zou,Xiaoyu Xu,Zhihua Wang,Yilin Wang,Balu Adsumilli,Kede Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Learning-based video quality, video quality assessment, quality assessment, advanced rapidly, Learning-based video
备注:
点击查看摘要
Abstract:Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.
105. 【2603.11521】EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
链接:https://arxiv.org/abs/2603.11521
作者:Shuo Jiang,Gaojia Zhang,Min Tan,Yufei Yin,Gang Pan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Unsupervised Camouflaged Object, Camouflaged Object Detection, Unsupervised Camouflaged, challenging task due, high intrinsic similarity
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.
106. 【2603.11520】FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval
链接:https://arxiv.org/abs/2603.11520
作者:Chenchen Zhao,Jianhuan Zhuo,Muxi Chen,Zhaohua Zhang,Wenyu Jiang,Tianwen Jiang,Qiuyong Xiao,Jihong Zhang,Qiang Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:semantic modifications presented, Composed image retrieval, Composed image, text-image input pairs, CIR
备注: 20 pages, 5 figures, 15 tables
点击查看摘要
Abstract:Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.
107. 【2603.11519】Prediction of Grade, Gender, and Academic Performance of Children and Teenagers from Handwriting Using the Sigma-Lognormal Model
链接:https://arxiv.org/abs/2603.11519
作者:Adrian Iste,Kazuki Nishizawa,Chisa Tanaka,Andrew Vargo,Anna Scius-Bertrand,Andreas Fischer,Koichi Kise
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:Digital handwriting acquisition, underlying writing behavior, processes underlying writing, motor processes underlying, Digital handwriting
备注: 18 pages, 8 figures
点击查看摘要
Abstract:Digital handwriting acquisition enables the capture of detailed temporal and kinematic signals reflecting the motor processes underlying writing behavior. While handwriting analysis has been extensively explored in clinical or adult populations, its potential for studying developmental and educational characteristics in children remains less investigated. In this work, we examine whether handwriting dynamics encode information related to student characteristics using a large-scale online dataset collected from Japanese students from elementary school to junior high school. We systematically compare three families of handwriting-derived features: basic statistical descriptors of kinematic signals, entropy-based measures of variability, and parameters obtained from the sigma-lognormal model. Although the dataset contains dense stroke-level recordings, features are aggregated at the student level to enable a controlled comparison between representations. These features are evaluated across three prediction tasks: grade prediction, gender classification, and academic performance classification, using Linear or Logistic Regression and Random Forest models under consistent experimental settings. The results show that handwriting dynamics contain measurable signals related to developmental stage and individual differences, especially for the grade prediction task. These findings highlight the potential of kinematic handwriting analysis and confirm that through their development, children's handwriting evolves toward a lognormal motor organization.
108. 【2603.11512】From Pen Strokes to Sleep States: Detecting Low-Recovery Days Using Sigma-Lognormal Handwriting Features
链接:https://arxiv.org/abs/2603.11512
作者:Chisa Tanaka,Andrew Vargo,Anna Scius-Bertrand,Andreas Fischer,Koichi Kise
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:individuals remains unexplored, healthy individuals remains, potential to reflect, remains unexplored, traditionally been studied
备注: 16 pages, 7 figures
点击查看摘要
Abstract:While handwriting has traditionally been studied for character recognition and disease classification, its potential to reflect day-to-day physiological fluctuations in healthy individuals remains unexplored. This study examines whether daily variations in sleep-related recovery states can be inferred from online handwriting dynamics. % We propose a personalized binary classification framework that detects low-recovery days using features derived from the Sigma-Lognormal model, which captures the neuromotor generation process of pen strokes. In a 28-day in-the-wild study involving 13 university students, handwriting was recorded three times daily, and nocturnal cardiac indicators were measured using a wearable ring. For each participant, the lowest (or highest) quartile of four sleep-related metrics -- HRV, lowest heart rate, average heart rate, and total sleep duration -- defined the positive class. Leave-One-Day-Out cross-validation showed that PR-AUC significantly exceeded the baseline (0.25) for all four variables after FDR correction, with the strongest performance observed for cardiac-related variables. Importantly, classification performance did not differ significantly across task types or recording timings, indicating that recovery-related signals are embedded in general movement dynamics. These results demonstrate that subtle within-person autonomic recovery fluctuations can be detected from everyday handwriting, opening a new direction for non-invasive, device-independent health monitoring.
109. 【2603.11509】Manifold-Optimal Guidance: A Unified Riemannian Control View of Diffusion Guidance
链接:https://arxiv.org/abs/2603.11509
作者:Zexi Jia,Pengcheng Luo,Zhengyao Fang,Jinchao Zhang,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:notoriously induce oversaturation, scales notoriously induce, facto control mechanism, high guidance scales, guidance scales notoriously
备注:
点击查看摘要
Abstract:Classifier-Free Guidance (CFG) serves as the de facto control mechanism for conditional diffusion, yet high guidance scales notoriously induce oversaturation, texture artifacts, and structural collapse. We attribute this failure to a geometric mismatch: standard CFG performs Euclidean extrapolation in ambient space, inadvertently driving sampling trajectories off the high-density data manifold. To resolve this, we present Manifold-Optimal Guidance (MOG), a framework that reformulates guidance as a local optimal control problem. MOG yields a closed-form, geometry-aware Riemannian update that corrects off-manifold drift without requiring retraining. Leveraging this perspective, we further introduce Auto-MOG, a dynamic energy-balancing schedule that adaptively calibrates guidance strength, effectively eliminating the need for manual hyperparameter tuning. Extensive validation demonstrates that MOG yields superior fidelity and alignment compared to baselines, with virtually no added computational overhead.
110. 【2603.11505】Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices
链接:https://arxiv.org/abs/2603.11505
作者:Rambod Azimi,Yuri Grinberg,Dan-Xia Xu,Odile Liboiron-Ladouceur
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:exhibit fabrication-induced variations, alter device performance, Silicon photonic devices, significantly alter device, Silicon photonic
备注: Accepted and published in Structural and Multidisciplinary Optimization (2026)
点击查看摘要
Abstract:Silicon photonic devices often exhibit fabrication-induced variations such as over-etching, underetching, and corner rounding, which can significantly alter device performance. These variations are non-uniform and are influenced by feature size and shape. Accurate digital twins are therefore needed to predict the range of possible fabricated outcomes for a given design. In this paper, we introduce Gen-Fab, a conditional generative adversarial network (cGAN) based on Pix2Pix to predict and model uncertainty in photonic fabrication outcomes. The proposed method takes a design layout (in GDS format) as input and produces diverse high-resolution predictions similar to scanning electron microscope (SEM) images of fabricated devices, capturing the range of process variations at the nanometer scale. To enable one-to-many mapping, we inject a latent noise vector at the model bottleneck. We compare Gen-Fab against three baselines: (1) a deterministic U-Net predictor, (2) an inference-time Monte Carlo Dropout U-Net, and (3) an ensemble of varied U-Nets. Evaluations on an out-of-distribution dataset of fabricated photonic test structures demonstrate that Gen-Fab outperforms all baselines in both accuracy and uncertainty modeling. An additional distribution shift analysis further confirms its strong generalization to unseen fabrication geometries. Gen-Fab achieves the highest intersection-over-union (IoU) score of 89.8%, outperforming the deterministic U-Net (85.3%), the MC-Dropout U-Net (83.4%), and varying U-Nets (85.8%). It also better aligns with the distribution of real fabrication outcomes, achieving lower Kullback-Leibler divergence and Wasserstein distance.
111. 【2603.11498】ActiveFreq: Integrating Active Learning and Frequency Domain Analysis for Interactive Segmentation
链接:https://arxiv.org/abs/2603.11498
作者:Lijun Guo,Qian Zhou,Zidi Shi,Hua Zou,Gang Ke
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:typically involving iterative, involving iterative user, iterative user input, obtain precise, typically involving
备注: 16 pages, 8 figures, published in Knowledge-Based Systems
点击查看摘要
Abstract:Interactive segmentation is commonly used in medical image analysis to obtain precise, pixel-level labeling, typically involving iterative user input to correct mislabeled regions. However, existing approaches often fail to fully utilize user knowledge from interactive inputs and achieve comprehensive feature extraction. Specifically, these methods tend to treat all mislabeled regions equally, selecting them randomly for refinement without evaluating each region's potential impact on segmentation quality. Additionally, most models rely solely on spatial domain features, overlooking frequency domain information that could enhance feature extraction and improve performance. To address these limitations, we propose ActiveFreq, a novel interactive segmentation framework that integrates active learning and frequency domain analysis to minimize human intervention while achieving high-quality labeling. ActiveFreq introduces AcSelect, an autonomous module that prioritizes the most informative mislabeled regions, ensuring maximum performance gain from each click. Moreover, we develop FreqFormer, a segmentation backbone incorporating a Fourier transform module to map features from the spatial to the frequency domain, enabling richer feature extraction. Evaluations on the ISIC-2017 and OAI-ZIB datasets demonstrate that ActiveFreq achieves high performance with reduced user interaction, achieving 3.74 NoC@90 on ISIC-2017 and 9.27 NoC@90 on OAI-ZIB, with 23.5% and 12.8% improvements over previous best results, respectively. Under minimal input conditions, such as two clicks, ActiveFreq reaches mIoU scores of 85.29% and 75.76% on ISIC-2017 and OAI-ZIB, highlighting its efficiency and accuracy in interactive medical segmentation.
112. 【2603.11493】OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure
链接:https://arxiv.org/abs/2603.11493
作者:Chuancheng Shi,Wenhua Wu,Fei Shen,Xiaogang Zhu,Kun Hu,Zhiyong Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:suppressing selected neurons, face significant safety, significant safety risks, models face significant, adversarial induction
备注:
点击查看摘要
Abstract:Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.
113. 【2603.11492】SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
链接:https://arxiv.org/abs/2603.11492
作者:Xiaogang Du,Jiawei Zhang,Tongfei Liu,Tao Lei,Yingbo Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:domain gap caused, medical image segmentation, image segmentation tasks, data collection, testing data
备注: Accepted to CVPR 2026. 16 pages, 7 figures
点击查看摘要
Abstract:In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at this https URL.
114. 【2603.11481】INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
链接:https://arxiv.org/abs/2603.11481
作者:Junqi Yang,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Video Large Language, Large Language, remain unreliable due, verifiable world knowledge
备注:
点击查看摘要
Abstract:Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.
115. 【2603.11460】Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
链接:https://arxiv.org/abs/2603.11460
作者:Seung hee Choi,MinJu Jeon,Hyunwoo Oh,Jihwan Lee,Dong-Jin Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dense Video Captioning, Existing retrieval-augmented approaches, Video Captioning, Dense Video, true event boundaries
备注: CVPR 2026 accepted paper (main track)
点击查看摘要
Abstract:Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at this https URL
116. 【2603.11442】GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics
链接:https://arxiv.org/abs/2603.11442
作者:Yan Zhang,Simiao Ren,Ankit Raj,En Wei,Dennis Ng,Alex Shen,Jiayue Xu,Yuxin Zhang,Evelyn Marotta
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:detect AI-generated financial, AI-generated financial documents, financial documents, Abstract, Claude Sonnet
备注: 12 pages, 7 figures, 7 tables
点击查看摘要
Abstract:Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.
117. 【2603.11441】Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection
链接:https://arxiv.org/abs/2603.11441
作者:Mehmet Kerem Turkcan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produced promptable detection, accept arbitrary natural, arbitrary natural language, natural language queries, Recent advances
备注:
点击查看摘要
Abstract:Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at this https URL.
118. 【2603.11439】Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
链接:https://arxiv.org/abs/2603.11439
作者:Seung Hyup Baek,Jimin Lee,Hyeongkeun Lee,Jae Won Cho
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dense Video Captioning, Dense Video, involves temporally localizing, temporally localizing multiple, challenging multimodal task
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.
119. 【2603.11423】Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding
链接:https://arxiv.org/abs/2603.11423
作者:Songlin Li,Xin Zhu,Zechao Guan,Peipeng Chen,Jian Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, Traditional black-box distillation, Traditional black-box, Large Vision-Language, yields high-variance responses
备注:
点击查看摘要
Abstract:Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).
120. 【2603.11421】ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation
链接:https://arxiv.org/abs/2603.11421
作者:Songlin Yang,Zhe Wang,Xuyi Yang,Songchun Zhang,Xianghao Kong,Taiyi Wu,Xiaotong Zhao,Ran Zhang,Alan Zhao,Anyi Rao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:democratized film creation, Text-driven video generation, multi-shot scenarios remains, Text-driven video, film creation
备注:
点击查看摘要
Abstract:Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.
121. 【2603.11417】Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations
链接:https://arxiv.org/abs/2603.11417
作者:Fatemeh Naeinian,Ali Hamza,Haoran Zhu,Anna Choromanska
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:remains largely unexamined, unseen cities remains, cities remains largely, largely unexamined, typically trained
备注:
点击查看摘要
Abstract:End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.
122. 【2603.11410】Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary
链接:https://arxiv.org/abs/2603.11410
作者:Nazia Tasnim,Keanu Nichols,Yuting Yang,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan A. Plummer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans learn object, mentally rotating, orientation, object orientation progressively, Orientation Reasoning Intelligence
备注:
点击查看摘要
Abstract:Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.
123. 【2603.11404】Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization
链接:https://arxiv.org/abs/2603.11404
作者:Hanyang Hu,Zekai Liang,Florian Richter,Michael C. Yip
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Robot-Assisted Minimally Invasive, Minimally Invasive Surgery, Minimally Invasive, Accurate and efficient, Robot-Assisted Minimally
备注:
点击查看摘要
Abstract:Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.
124. 【2603.11403】DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification
链接:https://arxiv.org/abs/2603.11403
作者:Ravi Mosalpuri,Mohammed Abdelsamea,Ahmed Karam Eldaly
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:detailed cellular-level assessment, tissue morphology, remains the gold, gold standard, detailed cellular-level
备注:
点击查看摘要
Abstract:Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.
125. 【2603.11396】Harnessing Data Asymmetry: Manifold Learning in the Finsler World
链接:https://arxiv.org/abs/2603.11396
作者:Thomas Dagès,Simon Weber,Daniel Cremers,Ron Kimmel
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:analysis and visualisation, fundamental task, Riemannian geometry, Finsler, Finsler manifold learning
备注:
点击查看摘要
Abstract:Manifold learning is a fundamental task at the core of data analysis and visualisation. It aims to capture the simple underlying structure of complex high-dimensional data by preserving pairwise dissimilarities in low-dimensional embeddings. Traditional methods rely on symmetric Riemannian geometry, thus forcing symmetric dissimilarities and embedding spaces, e.g. Euclidean. However, this discards in practice valuable asymmetric information inherent to the non-uniformity of data samples. We suggest to harness this asymmetry by switching to Finsler geometry, an asymmetric generalisation of Riemannian geometry, and propose a Finsler manifold learning pipeline that constructs asymmetric dissimilarities and embeds in a Finsler space. This greatly broadens the applicability of existing asymmetric embedders beyond traditionally directed data to any data. We also modernise asymmetric embedders by generalising current reference methods to asymmetry, like Finsler t-SNE and Finsler Umap. On controlled synthetic and large real datasets, we show that our asymmetric pipeline reveals valuable information lost in the traditional pipeline, e.g. density hierarchies, and consistently provides superior quality embeddings than their Euclidean counterparts.
126. 【2603.11389】High-Precision 6DOF Pose Estimation via Global Phase Retrieval in Fringe Projection Profilometry for 3D Mapping
链接:https://arxiv.org/abs/2603.11389
作者:Sehoon Tak,Keunhee Cho,Sangpil Kim,Jae-Sang Hyun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Digital fringe projection, Digital fringe, large-scale mapping remains, mapping remains challenging, enables micrometer-level
备注:
点击查看摘要
Abstract:Digital fringe projection (DFP) enables micrometer-level 3D reconstruction, yet extending it to large-scale mapping remains challenging because six-degree-of-freedom pose estimation often cannot match the reconstruction's precision. Conventional iterative closest point (ICP) registration becomes inefficient on multi-million-point clouds and typically relies on downsampling or feature-based selection, which can reduce local detail and degrade pose precision. Drift-correction methods improve long-term consistency but do not resolve sampling sensitivity in dense DFP point this http URL propose a high-precision pose estimation method that augments a moving DFP system with a fixed, intrinsically calibrated global projector. Using the global projector's phase-derived pixel constraints and a PnP-style reprojection objective, the method estimates the DFP system pose in a fixed reference frame without relying on deterministic feature extraction, and we experimentally demonstrate sampling invariance under coordinate-preserving subsampling. Experiments demonstrate sub-millimeter pose accuracy against a reference with quantified uncertainty bounds, high repeatability under aggressive subsampling, robust operation on homogeneous surfaces and low-overlap views, and reduced error accumulation when used to correct ICP-based trajectories. The method extends DFP toward accurate 3D mapping in quasi-static scenarios such as inspection and metrology, with the trade-off of time-multiplexed acquisition for the additional projector measurements.
127. 【2603.11380】DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding
链接:https://arxiv.org/abs/2603.11380
作者:Mingzhe Tao,Ruiping Liu,Junwei Zheng,Yufan Chen,Kedi Ying,M. Saquib Sarfraz,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Fusing sensors, Language Models, crucial for maintaining
备注:
点击查看摘要
Abstract:Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.
128. 【2603.11346】Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
链接:https://arxiv.org/abs/2603.11346
作者:Yuto Shibata,Kashu Yamazaki,Lalit Jayanti,Yoshimitsu Aoki,Mariko Isogawa,Katerina Fragkiadaki
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:transform daily service, caregiving applications, robotics has strong, strong potential, potential to transform
备注: Accepted at CVPR 2026 (main). Project page: [this https URL](https://yutoshibata07.github.io/AssistMimic-projectpage/)
点击查看摘要
Abstract:Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.
129. 【2603.11325】owards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis
链接:https://arxiv.org/abs/2603.11325
作者:Zhenxuan Zhang,Peiyuan Jing,Ruicheng Yuan,Liwei Hu,Anbang Wang,Fanwen Wang,Yinzhe Wu,Kh Tohidul Islam,Zhaolin Chen,Zi Wang,Peter Lally,Guang Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Low-field to high-field, access to high-field, high-field scanners, acquisition constraints, limited or impractical
备注:
点击查看摘要
Abstract:Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.
130. 【2603.11323】UNet-AF: An alias-free UNet for image restoration
链接:https://arxiv.org/abs/2603.11323
作者:Jérémy Scanvic,Quentin Barthélemy,Julián Tachella
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion models, simplicity and effectiveness, makes it ubiquitous, UNet architecture makes, image segmentation
备注:
点击查看摘要
Abstract:The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at this https URL
131. 【2603.11320】UniCompress: Token Compression for Unified Vision-Language Understanding and Generation
链接:https://arxiv.org/abs/2603.11320
作者:Ziyao Wang,Chen Chen,Jingtao Li,Weiming Zhuang,Jiabo Huang,Ang Li,Lingjuan Lyu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single autoregressive framework, Unified models aim, aim to support, processing them alongside, alongside text
备注:
点击查看摘要
Abstract:Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.
132. 【2603.11306】Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild
链接:https://arxiv.org/abs/2603.11306
作者:Jun Yu,Yunxiang Zhang,Naixiang Zheng,Lingsi Zhu,Guoyuan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Facial Action Unit, Action Unit, severe spatial-temporal heterogeneity, complex audio-visual dependencies, formidable challenge due
备注: 8 pages, 1 figures
点击查看摘要
Abstract:Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space this http URL, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual this http URL experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.
133. 【2603.11298】InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction
链接:https://arxiv.org/abs/2603.11298
作者:Dingqiang Ye,Jiacong Xu,Jianglu Ping,Yuxiang Guo,Chao Fan,Vishal M. Patel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High dynamic range, low dynamic range, dynamic range, High dynamic, multi-exposure low dynamic
备注:
点击查看摘要
Abstract:High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes from multi-exposure low dynamic range (LDR) images. Existing HDR pipelines heavily rely on known camera poses, well-initialized dense point clouds, and time-consuming per-scene optimization. Current feed-forward alternatives overlook the HDR problem by assuming exposure-invariant appearance. To bridge this gap, we propose InstantHDR, a feed-forward network that reconstructs 3D HDR scenes from uncalibrated multi-exposure LDR collections in a single forward pass. Specifically, we design a geometry-guided appearance modeling for multi-exposure fusion, and a meta-network for generalizable scene-specific tone mapping. Due to the lack of HDR scene data, we build a pre-training dataset, called HDR-Pretrain, for generalizable feed-forward HDR models, featuring 168 Blender-rendered scenes, diverse lighting types, and multiple camera response functions. Comprehensive experiments show that our InstantHDR delivers comparable synthesis performance to the state-of-the-art optimization-based HDR methods while enjoying $\sim700\times$ and $\sim20\times$ reconstruction speed improvement with our single-forward and post-optimization settings. All code, models, and datasets will be released after the review process.
134. 【2603.11257】owards Automated Initial Probe Placement in Transthoracic Teleultrasound Using Human Mesh and Skeleton Recovery
链接:https://arxiv.org/abs/2603.11257
作者:Yu Chung Lee,David G. Black,Ryan S. Yeung,Septimiu E. Salcudean
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adjusting probe position, Cardiac and lung, intercostal acoustic windows, lung ultrasound, ultrasound are technically
备注: 10 pages, 6 figures. Under review
点击查看摘要
Abstract:Cardiac and lung ultrasound are technically demanding because operators must identify patient-specific intercostal acoustic windows and then navigate between standard views by adjusting probe position, rotation, and force across different imaging planes. These challenges are amplified in teleultrasound when a novice or robot faces the difficult task of first placing the probe on the patient without in-person expert assistance. We present a framework for automating Patient registration and anatomy-informed Initial Probe placement Guidance (PIPG) using only RGB images from a calibrated camera. The novice first captures the patient using the camera on a mixed reality (MR) head-mounted display (HMD). An edge server then infers a patient-specific body-surface and skeleton model, with spatial smoothing across multiple views. Using bony landmarks from the predicted skeleton, we estimate the intercostal region and project the guidance back onto the reconstructed body surface. To validate the framework, we overlaid the reconstructed body mesh and the virtual probe pose guidance across multiple transthoracic echocardiography scan planes in situ and measured the quantitative placement error. Pilot experiments with healthy volunteers suggest that the proposed probe placement prediction and MR guidance yield consistent initial placement within anatomical variability acceptable for teleultrasound setup
135. 【2603.11252】Radiometric fingerprinting of object surfaces using mobile laser scanning and semantic 3D road space models
链接:https://arxiv.org/abs/2603.11252
作者:Benedikt Schwab,Thomas H. Kolbe
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely untapped, information remains largely, material information remains, increasingly detailed, largely untapped
备注:
点击查看摘要
Abstract:Although semantic 3D city models are internationally available and becoming increasingly detailed, the incorporation of material information remains largely untapped. However, a structured representation of materials and their physical properties could substantially broaden the application spectrum and analytical capabilities for urban digital twins. At the same time, the growing number of repeated mobile laser scans of cities and their street spaces yields a wealth of observations influenced by the material characteristics of the corresponding surfaces. To leverage this information, we propose radiometric fingerprints of object surfaces by grouping LiDAR observations reflected from the same semantic object under varying distances, incident angles, environmental conditions, sensors, and scanning campaigns. Our study demonstrates how 312.4 million individual beams acquired across four campaigns using five LiDAR sensors on the Audi Autonomous Driving Dataset (A2D2) vehicle can be automatically associated with 6368 individual objects of the semantic 3D city model. The model comprises a comprehensive and semantic representation of four inner-city streets at Level of Detail (LOD) 3 with centimeter-level accuracy. It is based on the CityGML 3.0 standard and enables fine-grained sub-differentiation of objects. The extracted radiometric fingerprints for object surfaces reveal recurring intra-class patterns that indicate class-dominant materials. The semantic model, the method implementations, and the developed geodatabase solution 3DSensorDB are released under: this https URL
136. 【2603.11246】When Slots Compete: Slot Merging in Object-Centric Learning
链接:https://arxiv.org/abs/2603.11246
作者:Christos Chatzisavvas,Panagiotis Rigas,George Ioannakis,Vassilis Katsouros,Nikolaos Mitianoudis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Slot-based object-centric learning, object-centric learning represents, Slot-based object-centric, represents an image, object-centric learning
备注:
点击查看摘要
Abstract:Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.
137. 【2603.11220】Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models
链接:https://arxiv.org/abs/2603.11220
作者:Qingtao Pan,Zhihao Dou,Shuo Li
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Multimodal Models, Large Multimodal, Multimodal Models, adapt varying computational, varying computational budgets
备注:
点击查看摘要
Abstract:Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
138. 【2603.11219】Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning
链接:https://arxiv.org/abs/2603.11219
作者:Yuehao Song,Shaoyu Chen,Hao Gao,Yifan Zhu,Weixiang Yue,Jialv Zou,Bo Jiang,Zihao Lu,Yu Wang,Qian Zhang,Xinggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-level semantic reasoning, leveraging high-level semantic, Vision-language models, semantic reasoning, VLM high-level decision
备注: 15 pages, 8 figures. Project page: [this https URL](https://ambitious-idiot.github.io/senna2-project)
点击查看摘要
Abstract:Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).
139. 【2603.11211】A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters
链接:https://arxiv.org/abs/2603.11211
作者:Haihua Luo,Xuming Ran,Jiangrong Shen,Timo Hämäläinen,Zhonghua Chen,Qi Xu,Fengyu Cong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:previously acquired knowledge, preserving previously acquired, aims to learn, acquired knowledge, preserving previously
备注:
点击查看摘要
Abstract:Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).
140. 【2603.11206】Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction
链接:https://arxiv.org/abs/2603.11206
作者:Jingxing Zhong,Qingtao Pan,Xuchang Zhou,Jiazhen Lin,Xinguo Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Breast Tumor Segmentation, Magnetic Resonance Imaging, tumor segmentation, women worldwide, fatalities annually
备注:
点击查看摘要
Abstract:Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.
141. 【2603.11174】GGPT: Geometry Grounded Point Transformer
链接:https://arxiv.org/abs/2603.11174
作者:Yutong Chen,Yiming Wang,Xucong Zhang,Sergey Prokudin,Siyu Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent feed-forward networks, achieved remarkable progress, RGB images, directly from RGB, progress in sparse-view
备注: CVPR 2026, Project website: [this https URL](https://chenyutongthu.github.io/research/ggpt)
点击查看摘要
Abstract:Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.
142. 【2603.11147】Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
链接:https://arxiv.org/abs/2603.11147
作者:Minsak Nanang,Adrian Hilton,Armin Mustafa
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:material remains effectively, remains effectively locked, growing rapidly, lacks consistent, galleries are growing
备注:
点击查看摘要
Abstract:Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
143. 【2603.11142】Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT
链接:https://arxiv.org/abs/2603.11142
作者:Sai V R Chereddy
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:tasks represent nuanced, represent nuanced, paper explores, affect the final, key challenge
备注: Accepted at the AAAI 2026 Workshop on Deployable AI (DAI). Non-archival. Code and custom dataset available upon request
点击查看摘要
Abstract:The paper explores how video models trained for classification tasks represent nuanced, hidden semantic information that may not affect the final outcome, a key challenge for Trustworthy AI models. Through Explainable and Interpretable AI methods, specifically mechanistic interpretability techniques, the internal circuit responsible for representing the action's outcome is reverse-engineered in a pre-trained video vision transformer, revealing that the "Success vs Failure" signal is computed through a distinct amplification cascade. While there are low-level differences observed from layer 0, the abstract and semantic representation of the outcome is progressively amplified from layers 5 through 11. Causal analysis, primarily using activation patching supported by ablation results, reveals a clear division of labor: Attention Heads act as "evidence gatherers", providing necessary low-level information for partial signal recovery, while MLP Blocks function as robust "concept composers", each of which is the primary driver to generate the "success" signal. This distributed and redundant circuit in the model's internals explains its resilience to simple ablations, demonstrating a core computational pattern for processing human-action outcomes. Crucially, the existence of this sophisticated circuit for representing complex outcomes, even within a model trained only for simple classification, highlights the potential for models to develop forms of 'hidden knowledge' beyond their explicit task, underscoring the need for mechanistic oversight for building genuinely Explainable and Trustworthy AI systems intended for deployment.
144. 【2603.11106】RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
链接:https://arxiv.org/abs/2603.11106
作者:Shijie Zhou,Bin Zhu,Jiarui Yang,Xiangyu Zhao,Jingjing Chen,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:execute increasingly complex, Recent advances, increasingly complex tasks, VLA models, execute increasingly
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
点击查看摘要
Abstract:Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.
145. 【2603.11085】Edge-Assisted Multi-Robot Visual-Inertial SLAM with Efficient Communication
链接:https://arxiv.org/abs/2603.11085
作者:Xin Liu,Shuhuan Wen,Jing Zhao,Tony Z. Qiu,Hong Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:multi-robot Simultaneous Localization, Localization and Mapping, Simultaneous Localization, real-time multi-robot Simultaneous, achieve global consistent
备注: 13 pages, 18 figures
点击查看摘要
Abstract:The integration of cloud computing and edge computing is an effective way to achieve global consistent and real-time multi-robot Simultaneous Localization and Mapping (SLAM). Cloud computing effectively solves the problem of limited computing, communication and storage capacity of terminal equipment. However, limited bandwidth and extremely long communication links between terminal devices and the cloud result in serious performance degradation of multi-robot SLAM systems. To reduce the computational cost of feature tracking and improve the real-time performance of the robot, a lightweight SLAM method of optical flow tracking based on pyramid IMU prediction is proposed. On this basis, a centralized multi-robot SLAM system based on a robot-edge-cloud layered architecture is proposed to realize real-time collaborative SLAM. It avoids the problems of limited on-board computing resources and low execution efficiency of single robot. In this framework, only the feature points and keyframe descriptors are transmitted and lossless encoding and compression are carried out to realize real-time remote information transmission with limited bandwidth resources. This design reduces the actual bandwidth occupied in the process of data transmission, and does not cause the loss of SLAM accuracy caused by data compression. Through experimental verification on the EuRoC dataset, compared with the current most advanced local feature compression method, our method can achieve lower data volume feature transmission, and compared with the current advanced centralized multi-robot SLAM scheme, it can achieve the same or better positioning accuracy under low computational load.
146. 【2603.11071】nyNav: End-to-End TinyML for Real-Time Autonomous Navigation on Microcontrollers
链接:https://arxiv.org/abs/2603.11071
作者:Pooria Roy,Nourhan Jadallah. Tomer Lapid,Shahzaib Ahmad,Armita Afroushe,Mete Bayrak
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:navigation typically relies, power-intensive processors, limiting accessibility, low-cost robotics, Autonomous navigation typically
备注: 6 pages, 7 figures, presented at CUCAI2026 (Canadian Undergraduate Conference on AI, [this https URL](https://cucai.ca) )
点击查看摘要
Abstract:Autonomous navigation typically relies on power-intensive processors, limiting accessibility in low-cost robotics. Although microcontrollers offer a resource-efficient alternative, they impose strict constraints on model complexity. We present TinyNav, an end-to-end TinyML system for real-time autonomous navigation on an ESP32 microcontroller. A custom-trained, quantized 2D convolutional neural network processes a 20-frame sliding window of depth data to predict steering and throttle commands. By avoiding 3D convolutions and recurrent layers, the 23k-parameter model achieves 30 ms inference latency. Correlation analysis and Grad-CAM validation indicate consistent spatial awareness and obstacle avoidance behavior. TinyNav demonstrates that responsive autonomous control can be deployed directly on highly constrained edge devices, reducing reliance on external compute resources.
147. 【2603.12046】Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
链接:https://arxiv.org/abs/2603.12046
作者:Umberto Cappellazzo,Stavros Petridis,Maja Pantic
类目:Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:Audio-Visual Speech Recognition, Speech Recognition, Audio-Visual Speech, robust recognition, leverages both acoustic
备注: Project website: [this https URL](https://umbertocappellazzo.github.io/Dr-SHAP-AV)
点击查看摘要
Abstract:Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
148. 【2603.11928】AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys
链接:https://arxiv.org/abs/2603.11928
作者:Dichang Zhang,Yixuan Shao,Simon Birrer,Dimitris Samaras
类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vera C. Rubin, Rubin Observatory, space-based Euclid mission, upcoming decade, shaped by large
备注: 10 pages, 4 figures. Code available at [this https URL](https://github.com/ZHANG7DC/AS-Bridge)
点击查看摘要
Abstract:The upcoming decade of observational cosmology will be shaped by large sky surveys, such as the ground-based LSST at the Vera C. Rubin Observatory and the space-based Euclid mission. While they promise an unprecedented view of the Universe across depth, resolution, and wavelength, their differences in observational modality, sky coverage, point-spread function, and scanning cadence make joint analysis beneficial, but also challenging. To facilitate joint analysis, we introduce A(stronomical)S(urvey)-Bridge, a bidirectional generative model that translates between ground- and space-based observations. AS-Bridge learns a diffusion model that employs a stochastic Brownian Bridge process between the LSST and Euclid observations. The two surveys have overlapping sky regions, where we can explicitly model the conditional probabilistic distribution between them. We show that this formulation enables new scientific capabilities beyond single-survey analysis, including faithful probabilistic predictions of missing survey observations and inter-survey detection of rare events. These results establish the feasibility of inter-survey generative modeling. AS-Bridge is therefore well-positioned to serve as a complementary component of future LSST-Euclid joint data pipelines, enhancing the scientific return once data from both surveys become available. Data and code are available at \href{this https URL}{this https URL}.
149. 【2603.11850】Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning
链接:https://arxiv.org/abs/2603.11850
作者:Johan Andreas Balle Rubak,Sara Haghighat,Sanyam Jain,Mostafa Aldesoki,Akhilanand Chaurasia,Sarah Sadat Ehsani,Faezeh Dehghan Ghanatkaman,Ahmad Badruddin Ghazali,Julien Issa,Basel Khalil,Rishi Ramani,Ruben Pauwels
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:mandibular canal increases, alveolar nerve injury, inferior alveolar nerve, mandibular canal, nerve injury
备注:
点击查看摘要
Abstract:Impaction of the mandibular third molar in proximity to the mandibular canal increases the risk of inferior alveolar nerve injury. Panoramic radiography is routinely used to assess this relationship. Automated classification of molar-canal overlap could support clinical triage and reduce unnecessary CBCT referrals, while federated learning (FL) enables multi-center collaboration without sharing patient data. We compared Local Learning (LL), FL, and Centralized Learning (CL) for binary overlap/no-overlap classification on cropped panoramic radiographs partitioned across eight independent labelers. A pretrained ResNet-34 was trained under each paradigm and evaluated using per-client metrics with locally optimized thresholds and pooled test performance with a global threshold. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-based metrics, alongside training dynamics, Grad-CAM visualizations, and server-side aggregate monitoring signals. On the test set, CL achieved the highest performance (AUC 0.831; accuracy = 0.782), FL showed intermediate performance (AUC 0.757; accuracy = 0.703), and LL generalized poorly across clients (AUC range = 0.619-0.734; mean = 0.672). Training curves suggested overfitting, particularly in LL models, and Grad-CAM indicated more anatomically focused attention in CL and FL. Overall, centralized training provided the strongest performance, while FL offers a privacy-preserving alternative that outperforms LL.
150. 【2603.11806】A Diffeomorphism Groupoid and Algebroid Framework for Discontinuous Image Registration
链接:https://arxiv.org/abs/2603.11806
作者:Lili Bao,Bin Xiao,Shihui Ying,Stefan Sommer
类目:Group Theory (math.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:piecewise diffeomorphic image, Diffeomorphic Metric Mapping, diffeomorphic image registration, Deformation Diffeomorphic Metric, Large Deformation Diffeomorphic
备注:
点击查看摘要
Abstract:In this paper, we propose a novel mathematical framework for piecewise diffeomorphic image registration that involves discontinuous sliding motion using a diffeomorphism groupoid and algebroid approach. The traditional Large Deformation Diffeomorphic Metric Mapping (LDDMM) registration method builds on Lie groups, which assume continuity and smoothness in velocity fields, limiting its applicability in handling discontinuous sliding motion. To overcome this limitation, we extend the diffeomorphism Lie groups to a framework of discontinuous diffeomorphism Lie groupoids, allowing for discontinuities along sliding boundaries while maintaining diffeomorphism within homogeneous regions. We provide a rigorous analysis of the associated mathematical structures, including Lie algebroids and their duals, and derive specific Euler-Arnold equations to govern optimal flows for discontinuous deformations. Some numerical tests are performed to validate the efficiency of the proposed approach.
151. 【2603.11316】MRI2Qmap: multi-parametric quantitative mapping with MRI-driven denoising priors
链接:https://arxiv.org/abs/2603.11316
作者:Mohammad Golbabaee,Matteo Cencini,Carolin Pirkl,Marion Menzel,Michela Tosetti,Bjoern Menze
类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Magnetic Resonance Fingerprinting, Magnetic Resonance, Resonance Fingerprinting, multiple tissue properties, transient-state parameter mapping
备注:
点击查看摘要
Abstract:Magnetic Resonance Fingerprinting (MRF) and other highly accelerated transient-state parameter mapping techniques enable simultaneous quantification of multiple tissue properties, but often suffer from aliasing artifacts due to compressed sampling. Incorporating spatial image priors can mitigate these artifacts, and deep learning has shown strong potential when large training datasets are available. However, extending this paradigm to MRF-type sequences remains challenging due to the scarcity of quantitative imaging data for training. Can this limitation be overcome by leveraging sources of training data from clinically-routine weighted MRI images? To this end, we introduce MRI2Qmap, a plug-and-play quantitative reconstruction framework that integrates the physical acquisition model with priors learned from deep denoising autoencoders pretrained on large multimodal weighted-MRI datasets. MRI2Qmap demonstrates that spatial-domain structural priors learned from independently acquired datasets of routine weighted-MRI images can be effectively used for quantitative MRI reconstruction. The proposed method is validated on highly accelerated 3D whole-brain MRF data from both in-vivo and simulated acquisitions, achieving competitive or superior performance relative to existing baselines without requiring ground-truth quantitative imaging data for training. By decoupling quantitative reconstruction from the need for ground-truth MRF training data, this framework points toward a scalable paradigm for quantitative MRI that can capitalize on the large and growing repositories of routine clinical MRI.

