本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1176篇论文,其中:

  • 自然语言处理152
  • 信息检索37
  • 计算机视觉270

自然语言处理

1. 【2603.22286】WorldCache: Content-Aware Caching for Accelerated Video World Models

链接https://arxiv.org/abs/2603.22286

作者:Umair Nawaz,Ahmed Heakl,Ufaq Khan,Abdelrahman Shaker,Salman Khan,Fahad Shahbaz Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:power high-fidelity video, costly spatio-temporal attention, high-fidelity video world, video world models, remain computationally expensive

备注: 33 Pages

点击查看摘要

Abstract:Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{this https URL}{World-Cache}.

2. 【2603.22281】hinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

链接https://arxiv.org/abs/2603.22281

作者:Haichao Zhang,Yijiang Li,Shwai He,Tushar Nagarajan,Mingfei Chen,Jianglin Lu,Ang Li,Yun Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:shown promising capability, forecasting future world, Recent progress, future world states, shown promising

备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

3. 【2603.22267】Co: Time-Controllable Training for Spoken Dialogue Models

链接https://arxiv.org/abs/2603.22267

作者:Kai-Wei Chang,Wei-Chih Chen,En-Pei Hu,Hung-yi Lee,James Glass

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词:simple post-training method, follow time-constrained instructions, post-training method, enabling spoken dialogue, spoken dialogue models

备注

点击查看摘要

Abstract:We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., 10.6 seconds). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

4. 【2603.22260】Greater accessibility can amplify discrimination in generative AI

链接https://arxiv.org/abs/2603.22260

作者:Carolin Holtermann,Minh Duc Bui,Kaitlyn Zhou,Valentin Hofmann,Katharina von der Wense,Anne Lauscher

类目:Computation and Language (cs.CL)

关键词:Hundreds of millions, large language models, millions of people, people rely, rely on large

备注: Preprint

点击查看摘要

Abstract:Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.

5. 【2603.22241】MemDLM: Memory-Enhanced DLM Training

链接https://arxiv.org/abs/2603.22241

作者:Zehua Pei,Hui-Ling Zhen,Weizhe Lin,Sinno Jialin Pan,Yunhe Wang,Mingxuan Yuan,Bei Yu

类目:Computation and Language (cs.CL)

关键词:Diffusion Language Models, Diffusion Language, offer attractive advantages, full-attention parallel decoding, Language Models

备注

点击查看摘要

Abstract:Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: this https URL.

6. 【2603.22227】Dyadic: A Scalable Platform for Human-Human and Human-AI Conversation Research

链接https://arxiv.org/abs/2603.22227

作者:David M. Markowitz

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:social life, ubiquitous in social, empirical study, interactive process, insufficiently modular

备注

点击查看摘要

Abstract:Conversation is ubiquitous in social life, but the empirical study of this interactive process has been thwarted by tools that are insufficiently modular and unadaptive to researcher needs. To relieve many constraints in conversation research, the current tutorial presents an overview and introduction to a new tool, Dyadic (this https URL), a web-based platform for studying human-human and human-AI conversations using text-based or voice-based chats. Dyadic is distinct from other platforms by offering studies with multiple modalities, AI suggestions (e.g., in human-human studies, AI can suggest responses to a participant), live monitoring (e.g., researchers can evaluate, in real time, chats between communicators), and survey deployment (e.g., Likert-type scales, feeling thermometers, and open-ended text boxes can be sent to humans for in situ evaluations of the interaction), among other consequential features. No coding is required to operate Dyadic directly, and integrations with existing survey platforms are offered.

7. 【2603.22225】Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson's Disease

链接https://arxiv.org/abs/2603.22225

作者:Abner Hernandez,Eunjung Yeo,Kwanghee Choi,Chin-Jou Li,Zhengjun Yue,Rohan Kumar Das,Jan Rusz,Mathew Magimai Doss,Juan Rafael Orozco-Arroyave,Tomás Arias-Vergara,Andreas Maier,Elmar Nöth,David R. Mortensen,David Harwath,Paula Andrea Perez-Toro

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:dysarthric speech data, speech data makes, data makes cross-lingual, challenging problem, limited availability

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson's disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.

8. 【2603.22216】Gumbel Distillation for Parallel Text Generation

链接https://arxiv.org/abs/2603.22216

作者:Chi Zhang,Xixi Hu,Bo Liu,Qiang Liu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:sequential nature, nature of autoregressive, Gumbel Distillation, driven the adoption, Gumbel

备注: ICLR 2026

点击查看摘要

Abstract:The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at this https URL.

9. 【2603.22213】SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

链接https://arxiv.org/abs/2603.22213

作者:Kexian Tang,Jiani Wang,Shaowen Wang,Kaifeng Lyu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, motivating extensive efforts, coverage remains incomplete, knowledge coverage remains, knowledge injection

备注

点击查看摘要

Abstract:While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at this https URL.

10. 【2603.22186】Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

链接https://arxiv.org/abs/2603.22186

作者:Ireh Kim,Tesia Sker,Chanwoo Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, generally underperformed compared, conventional encoder-decoder systems

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.

11. 【2603.22136】he Semantic Ladder: A Framework for Progressive Formalization of Natural Language Content for Knowledge Graphs and AI Systems

链接https://arxiv.org/abs/2603.22136

作者:Lars Vogt

类目:Computation and Language (cs.CL); Databases (cs.DB)

关键词:Semantic, formal semantic models, created and communicated, reconcile two fundamentally, fundamentally different forms

备注

点击查看摘要

Abstract:Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.

12. 【2603.22103】Multiperspectivity as a Resource for Narrative Similarity Prediction

链接https://arxiv.org/abs/2603.22103

作者:Max Upravitelev,Veronika Solopova,Jing Yang,Charlott Jakob,Premtim Sahitaj,Ariana Sahitaj,Vera Schmitt

类目:Computation and Language (cs.CL)

关键词:Predicting narrative similarity, Predicting narrative, equally valid readings, similarity judgments, posing a fundamental

备注

点击查看摘要

Abstract:Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.

13. 【2603.22075】Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

链接https://arxiv.org/abs/2603.22075

作者:Caio Vicentino

类目:Computation and Language (cs.CL)

关键词:controlled empirical comparison, comparison between autoregressive, masked diffusion, present a controlled, controlled empirical

备注: 10 pages, 2 figures, 4 tables. Code and checkpoints at [this https URL](https://github.com/caiovicentino/arche)

点击查看摘要

Abstract:We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.

14. 【2603.22056】Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

链接https://arxiv.org/abs/2603.22056

作者:Stella Eva Tsiapali,Cong-Thanh Do,Kate Knill

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, language tasks, resource demands, Knowledge Distillation

备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

15. 【2603.22016】ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

链接https://arxiv.org/abs/2603.22016

作者:Xinyan Wang,Xiaogeng Liu,Chaowei Xiao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Reasoning Models, generating redundant reasoning, challenging tasks, redundant reasoning steps, generating long

备注: Code is available at [this https URL](https://github.com/SaFo-Lab/ROM)

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.

16. 【2603.22015】Retrieving Climate Change Disinformation by Narrative

链接https://arxiv.org/abs/2603.22015

作者:Max Upravitelev,Veronika Solopova,Charlott Jakob,Premtim Sahitaj,Vera Schmitt

类目:Computation and Language (cs.CL)

关键词:Detecting climate disinformation, Detecting climate, narratives typically relies, accommodate emerging narratives, typically relies

备注

点击查看摘要

Abstract:Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative's core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.

17. 【2603.22008】On the Challenges and Opportunities of Learned Sparse Retrieval for Code

链接https://arxiv.org/abs/2603.22008

作者:Simon Lupart,Maxime Louis,Thibault Formal,Hervé Déjean,Stéphane Clinchant

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:software engineering systems, modern LLM-based software, LLM-based software engineering, engineering systems, large codebases

备注: 15 pages, 5 figures, 12 tables

点击查看摘要

Abstract:Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

18. 【2603.21975】SecureBreak -- A dataset towards safe and secure models

链接https://arxiv.org/abs/2603.21975

作者:Marco Arazzi,Vignesh Kumar Kembu,Antonino Nocera

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, pervasive core components, Large language, real-world applications, pervasive core

备注

点击查看摘要

Abstract:Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.

19. 【2603.21972】Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

链接https://arxiv.org/abs/2603.21972

作者:Xixi Wu,Qianguo Sun,Ruiyang Zhang,Chao Song,Junlong Wu,Yiyan Qi,Hong Cheng

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:evolving Large Language, Large Language Models, Reinforcement Learning, multi-turn environments remains, environments remains elusive

备注: Codes are available at [this https URL](https://github.com/WxxShirley/Agent-STAR)

点击查看摘要

Abstract:Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.

20. 【2603.21970】Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning

链接https://arxiv.org/abs/2603.21970

作者:Ulugbek Shernazarov,Rostislav Svitsov,Bin Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:substantial computational resources, demands substantial computational, Fine-tuning large language, large language models, text summarization demands

备注: 9 pages, 5 figures, presented at 6th International Conference on NLP Text Mining (NLTM 2026), March 21-22, Sydney, Australia. Published in Computer Science Information Technology (CS IT), pp. 01-09, 2026

点击查看摘要

Abstract:Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at this https URL

21. 【2603.21966】BHDD: A Burmese Handwritten Digit Dataset

链接https://arxiv.org/abs/2603.21966

作者:Swan Htet Aung,Hein Htet,Htoo Say Wah Khaing,Thuya Myo Nyunt

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Burmese Handwritten Digit, handwritten Burmese digits, Burmese Handwritten, handwritten Burmese, Handwritten Digit Dataset

备注: 4 pages, 9 figures, 1 table. Dataset available at [this https URL](https://github.com/baseresearch/BHDD)

点击查看摘要

Abstract:We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at this https URL

22. 【2603.21940】SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding

链接https://arxiv.org/abs/2603.21940

作者:Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares

类目:Computation and Language (cs.CL)

关键词:Spoken Language Understanding, Spoken Language, Language Understanding, aims to extract, user queries

备注: Accepted at LREC 2026

点击查看摘要

Abstract:Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: this https URL.

23. 【2603.21900】Ara-Best-RQ: Multi Dialectal Arabic SSL

链接https://arxiv.org/abs/2603.21900

作者:Haroun Elleuch,Ryan Whetten,Salima Mdhaffar,Yannick Estève,Fethi Bougares

类目:Computation and Language (cs.CL)

关键词:Arabic speech processing, models specifically designed, Creative Commons speech, crawled Creative Commons, multi-dialectal Arabic speech

备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.

24. 【2603.21847】Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures

链接https://arxiv.org/abs/2603.21847

作者:Ajan Subramanian,Sumukh Bettadapura,Rohan Sathish

类目:Computation and Language (cs.CL)

关键词:entering everyday devices, Consumer-grade EEG, everyday devices, earbuds to headbands, raising the question

备注

点击查看摘要

Abstract:Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person's brain activity but not another's. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual's EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model's deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.

25. 【2603.21840】Select, Label, Evaluate: Active Testing in NLP

链接https://arxiv.org/abs/2603.21840

作者:Antonio Purificato,Maria Sofia Bucarelli,Andrea Bacciu,Amin Mantrach,Fabrizio Silvestri

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, Language Processing, Natural Language, time remain significant, remain significant bottlenecks

备注: 27 pages, 6 figures

点击查看摘要

Abstract:Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.

26. 【2603.21836】Instruction Set and Language for Symbolic Regression

链接https://arxiv.org/abs/2603.21836

作者:Ezequiel Lopez-Rubio,Mario Pascual-Gonzalez

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

关键词:largely unaddressed obstacle, distinct node-numbering schemes, consuming fitness evaluations, Symbolic regression, encodes expression DAGs

备注

点击查看摘要

Abstract:A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string -- a complete labeled-DAG isomorphism invariant -- that collapses all the equivalent representations into a single canonical form.

27. 【2603.21823】Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power

链接https://arxiv.org/abs/2603.21823

作者:Bros Victor,Barbini Matilde,Gerard Patrick,Gatica-Perez Daniel

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:rarely distinguish interrogatives, English-language corpora, large-scale computational studies, conversation analysis, differentiate their functions

备注: ICWSM 2026

点击查看摘要

Abstract:Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the "Politics of Questions" in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist's narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.

28. 【2603.21745】he Presupposition Problem in Representation Genesis

链接https://arxiv.org/abs/2603.21745

作者:Yiling Wu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, states guide behavior, non-representing physical system, high cognitive performance, achieve high cognitive

备注

点击查看摘要

Abstract:Large language models are the first systems to achieve high cognitive performance without clearly undergoing representation genesis: the transition from a non-representing physical system to one whose states guide behavior in a content-sensitive way. Prior cognitive systems had already made this transition before we could examine it, and philosophy of mind treated genesis as a background condition rather than an explanatory target. LLMs provide a case that does not clearly involve this transition, making the genesis question newly urgent: if genesis did not occur, which cognitive capacities are affected, and why? We currently lack the conceptual resources to answer this. The reason, this paper argues, is structural. Major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common feature when applied to the genesis question: at some explanatory step, each deploys concepts whose explanatory purchase depends on the system already being organized as a representer. This pattern, which we call the Representation Presupposition structure, generates systematic explanatory deferral. Attempts to explain the first acquisition of content-manipulable representation within the existing categorical vocabulary import resources from the representational side of the transition itself. We call this the Representation Regress. The paper offers a conceptual diagnosis rather than a new theory, establishing the structure of the problem and deriving two minimum adequacy conditions for any account that avoids this pattern. LLMs make the absence of such a theory consequential rather than merely theoretical.

29. 【2603.21736】he Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures

链接https://arxiv.org/abs/2603.21736

作者:Yiling Wu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:philosophy of mind, representational systems, demands exists, structural, demands

备注

点击查看摘要

Abstract:Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.

30. 【2603.21728】EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

链接https://arxiv.org/abs/2603.21728

作者:Andreas Sauter,Yuyue Zhao,Jacopo Urbani,Wenxiang Hu,Zaiqiao Meng,Lun Zhou,Xiaohui Yan,Yougang Lyu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:transform initial concepts, high-quality research proposals, research proposals remains, Large Language Models, challenge for Large

备注

点击查看摘要

Abstract:Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbf{EvoIdeator}, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbf{checklist-grounded feedback}. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emph{lexicographic rewards} for multi-dimensional optimization, and (2) \emph{fine-grained language feedback} that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.

31. 【2603.21720】SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models

链接https://arxiv.org/abs/2603.21720

作者:Pengfei Cao,Mingxuan Yang,Yubo Chen,Chenlong Zhang,Mingxuan Liu,Kang Liu,Jun Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language processing, direct-cause inference remains, inference remains underexplored, practical decision-making, evidence-rich settings

备注: 9 pages, 3 figures, semeval 2026 task 12 description paper

点击查看摘要

Abstract:Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at this https URL} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.

32. 【2603.21719】Probing How Scalable Table Data Enhances General Long-Context Reasoning

链接https://arxiv.org/abs/2603.21719

作者:Huaibing Xie,Guoliang Zhao,Yang Liu,Shihan Dou,Siming Huang,Yanling Xiao,Shaolei Wang,Yiting Liu,Cheng Zhang,Shaofan Liu,Pluto Zhou

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, grow increasingly complex, real-world tasks grow

备注

点击查看摘要

Abstract:As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24\% on average), and even improves performance on out-of-domain benchmarks (+8.06\% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.

33. 【2603.21676】hinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

链接https://arxiv.org/abs/2603.21676

作者:Hung-Hsuan Chen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Standard Transformers, fixed computational depth, requiring variable-depth reasoning, multi-hop graph traversal, tasks requiring variable-depth

备注

点击查看摘要

Abstract:Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

34. 【2603.21673】Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

链接https://arxiv.org/abs/2603.21673

作者:Shixu Liu

类目:Computation and Language (cs.CL)

关键词:Generating interpretable natural, natural language processing, interpretable natural language, Large Language Models, Generating interpretable

备注: Preprint and under consideration

点击查看摘要

Abstract:Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.

35. 【2603.21663】AMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

链接https://arxiv.org/abs/2603.21663

作者:Li Wang,Yandong Wang,Xin Yu,Kui Zhang,Tianhao Peng,Wenjun Wu

类目:Computation and Language (cs.CL)

关键词:remarkable performance gains, large language models, range of tasks, rapid progress, progress of large

备注

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at this https URL.

36. 【2603.21658】A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

链接https://arxiv.org/abs/2603.21658

作者:Bowen Chen,Namgi Han,Yusuke Miyao

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:component of intelligence, Memorization, model series, LLMs, memorized sequences

备注: 8 pages of main content, in conference submission, other contents are references and extra appendix

点击查看摘要

Abstract:Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.

37. 【2603.21636】Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks

链接https://arxiv.org/abs/2603.21636

作者:Yiliang Song,Hongjun An,Jiangan Chen,Xuanchen Yan,Huan Song,Jiawei Shao,Xuelong Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Public benchmarks increasingly, large language models, benchmarks increasingly govern, Public benchmarks, increasingly govern

备注: First update

点击查看摘要

Abstract:Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should not consistently outperform the clean-control baseline. Yet across multiple models, we find widespread but heterogeneous above-baseline gains under noisy conditions, indicating that benchmark-related cues may be reassembled and can reactivate contamination-related memory. These results suggest that similar benchmark scores may carry substantially different levels of confidence. Rather than rejecting benchmarks altogether, we argue that benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence.

38. 【2603.21571】DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing

链接https://arxiv.org/abs/2603.21571

作者:Nasser-Eddine Monir,Zakaria Baou

类目:Computation and Language (cs.CL)

关键词:parallel English-Tashlhiyt corpus, parallel English-Tashlhiyt, English-Tashlhiyt corpus, corpus that fills, fills a critical

备注: This paper has been accepted for presentation at LREC 2026

点击查看摘要

Abstract:DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.

39. 【2603.21529】SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

链接https://arxiv.org/abs/2603.21529

作者:Migyeong Kang,Jihyun Kim,Hyolim Jeon,Sunwoo Hwang,Jihyun An,Yonghoon Kim,Haewoon Kwak,Jisun An,Jinyoung Han

类目:Computation and Language (cs.CL)

关键词:users' mental states, infer fine-grained mental, fine-grained mental health, social media aims, mental health symptoms

备注

点击查看摘要

Abstract:Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.

40. 【2603.21524】CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs

链接https://arxiv.org/abs/2603.21524

作者:Ravi Ranjan,Utkarsh Grover,Mayur Akewar,Xiaomin Lin,Agoritsa Polyzou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, show demographic, fairness and trust, Prior debiasing methods

备注: 9 pages, 4 figures, and accepted in IJCNN 2026 (part of IEEE WCCI 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.

41. 【2603.21520】Generalizable Self-Evolving Memory for Automatic Prompt Optimization

链接https://arxiv.org/abs/2603.21520

作者:Guanbao Liang,Yuanchen Bei,Sheng Zhou,Yuheng Qin,Huan Zhou,Bingxin Jia,Bin Li,Jiajun Bu

类目:Computation and Language (cs.CL)

关键词:adapting large language, existing methods typically, methods typically search, large language models, Automatic prompt optimization

备注

点击查看摘要

Abstract:Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.

42. 【2603.21519】riangulating Temporal Dynamics in Multilingual Swiss Online News

链接https://arxiv.org/abs/2603.21519

作者:Bros Victor,Dufraisse Evan,Popescu Adrian,Gatica-Perez Daniel

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:ecosystems remain limited, Analyzing news coverage, national media ecosystems, media ecosystems remain, offer valuable insights

备注: ICWSM 2026

点击查看摘要

Abstract:Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country's three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.

43. 【2603.21494】Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

链接https://arxiv.org/abs/2603.21494

作者:Mohamed Sobhi Jabal(1),Jikai Zhang(2 and 3),Dominic LaBella(4),Jessica L. Houk(1),Dylan Zhang(1 and 7),Jeffrey D. Rudie(5 and 8),Kirti Magudia(1),Maciej A. Mazurowski(1, 2 and 6),Evan Calabrese(1 and 3) ((1) Duke University Medical Center, Durham NC, (2) Duke University, Durham NC, (3) Duke Center for Artificial Intelligence in Radiology, Durham NC, (4) Duke University Medical Center, Durham NC, (5) University of California San Diego, San Diego CA, (6) Duke University School of Medicine, Durham NC, (7) Santa Clara Valley Medical Center, San Jose CA, (8) Scripps Clinic Medical Group, San Diego CA)

类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Brain Tumor Reporting, Reporting and Data, requires complex integration, standardizes post-treatment MRI, Brain Tumor

备注: 17 pages, 5 figures, 4 tables, 2 supplementary figures, 3 supplementary tables

点击查看摘要

Abstract:The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.

44. 【2603.21489】Effective Strategies for Asynchronous Software Engineering Agents

链接https://arxiv.org/abs/2603.21489

作者:Jiayi Geng,Graham Neubig

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:issues on Github, increasingly capable, resolving issues, isolated software engineering, Github

备注

点击查看摘要

Abstract:AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

45. 【2603.21478】aigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

链接https://arxiv.org/abs/2603.21478

作者:Kai-Wei Chang,Yi-Cheng Lin,Huang-Cheng Chou,Wenze Ren,Yu-Han Huang,Yun-Shao Tsai,Chien-Cheng Chen,Yu Tsao,Yuan-Fu Liao,Shrikanth Narayanan,James Glass,Hung-yi Lee

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

关键词:diverse populations worldwide, serve diverse populations, populations worldwide, technologies have advanced, advanced rapidly

备注: submitted to Interspeech 2026

点击查看摘要

Abstract:Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on this https URL.

46. 【2603.21473】Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market Returns

链接https://arxiv.org/abs/2603.21473

作者:Wihan van der Heever,Keane Ong,Ranjan Satapathy,Erik Cambria

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:aspect-based sentiment analysis, distinguish genuine associations, financial markets, addressing the limitations, Newey West HAC

备注: 13 pages, 6 figures, submitted to Expert Systems with Applications

点击查看摘要

Abstract:This paper proposes a refutation-validated framework for aspect-based sentiment analysis in financial markets, addressing the limitations of correlational studies that cannot distinguish genuine associations from spurious ones. Using X data for the energy sector, we test whether aspect-level sentiment signals show robust, refutation-validated relationships with equity returns. Our pipeline combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and refutation tests including placebo, random common cause, subset stability, and bootstrap. Across six energy tickers, only a few associations survive all checks, while renewables show aspect and horizon specific responses. While not establishing causality, the framework provides statistically robust, directionally interpretable signals, with limited sample size (six stocks, one quarter) constraining generalizability and framing this work as a methodological proof of concept.

47. 【2603.21465】DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

链接https://arxiv.org/abs/2603.21465

作者:Siqi Guo,Ming Lin,Tianbao Yang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Developing efficient CUDA, Large Language Models, Developing efficient, efficient CUDA kernels, CUDA kernels

备注

点击查看摘要

Abstract:Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

48. 【2603.21461】DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

链接https://arxiv.org/abs/2603.21461

作者:James Wedgwood,Aashiq Muhamed,Mona T. Diab,Virginia Smith

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:limited mechanistic visibility, adds substantial alignment-stage, Dynamic SAE Steering, Preference alignment, mechanistic visibility

备注

点击查看摘要

Abstract:Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.

49. 【2603.21454】Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

链接https://arxiv.org/abs/2603.21454

作者:Tae-Eun Song

类目:Computation and Language (cs.CL)

关键词:LLM coding benchmarks, test quality issues, quality issues undermine, widespread solution leakage, LLM coding

备注: 11 pages, 3 figures, 4 tables

点击查看摘要

Abstract:LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

Comments:
11 pages, 3 figures, 4 tables

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.21454 [cs.CL]

(or
arXiv:2603.21454v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.21454

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2603.21440】KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

链接https://arxiv.org/abs/2603.21440

作者:Shuai Wang,Yinan Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, impressive natural language, natural language capabilities, Large Language, demonstrate impressive natural

备注: Accepted to IJCNN 2026

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: this https URL.

51. 【2603.21438】PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts

链接https://arxiv.org/abs/2603.21438

作者:Neeladri Bhuiya,Shib Sankar Dasgupta,Andrew McCallum,Haw-Shiuan Chang

类目:Computation and Language (cs.CL)

关键词:extract insightful patterns, insightful patterns, extract insightful, box embeddings, prompts

备注

点击查看摘要

Abstract:To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., "writing an adventure story" is more specific than "writing a story"). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9\% more LLM weaknesses than vector baselines and achieves an approximately 33\% stronger correlation between hierarchical depth and instruction specificity.

52. 【2603.21437】Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

链接https://arxiv.org/abs/2603.21437

作者:Hang Gao,Dimitris N. Metaxas

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:enabling efficient similarity, efficient similarity search, inducing well-known geometric, well-known geometric pathologies, Transformer-based embedding models

备注

点击查看摘要

Abstract:Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.21437 [cs.CL]

(or
arXiv:2603.21437v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.21437

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2603.21418】Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

链接https://arxiv.org/abs/2603.21418

作者:Mariela M. Nina,Caio Veloso Costa,Lilian Berton,Didier A. Vega-Oliveros

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:natural language processing, transformed natural language, create accessibility barriers, Brazilian Portuguese, Brazilian Portuguese translation

备注: 10 pages, 2 figures, PROPOR 2026

点击查看摘要

Abstract:Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.

54. 【2603.21404】Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

链接https://arxiv.org/abs/2603.21404

作者:Navya Mehrotra,Adam Visokay,Kristina Gligorić

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, annotate texts, language models, models are increasingly

备注

点击查看摘要

Abstract:Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

55. 【2603.21389】ask-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

链接https://arxiv.org/abs/2603.21389

作者:Jinghan Cao,Yu Ma,Xinjin Li,Qingyang Ren,Xiangyun Chen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, incur substantial computational, substantial computational costs, computational costs unsuitable, Large Language

备注: Accepted for publication at ESANN 2025. This is a task-specific efficiency analysis comparing small language models

点击查看摘要

Abstract:Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.

56. 【2603.21373】PLR: Plackett-Luce for Reordering In-Context Learning Examples

链接https://arxiv.org/abs/2603.21373

作者:Pawel Batorski,Paul Swoboda

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:adapts large language, avoiding costly parameter, large language models, ICL, adapts large

备注

点击查看摘要

Abstract:In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the $n!$ possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label-probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in-context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett-Luce model. PLR models orderings using a Plackett-Luce distribution and iteratively updates its parameters to concentrate probability mass on high-performing orderings under a task-level metric. Candidate orderings are sampled efficiently via a Gumbel perturb-and-sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few-shot accuracy for $k \in \{4, 8, 16, 32\}$ examples, and we further demonstrate gains on mathematical reasoning tasks where label-based ordering methods are not applicable. Our code is available at this https URL.

57. 【2603.21368】Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection

链接https://arxiv.org/abs/2603.21368

作者:Heidi Campana Piva,Shaina Ashraf,Maziar Kianimoghadam Jouneghani,Arianna Longo,Rossana Damiano,Lucie Flek,Marco Antonio Stranisci

类目:Computation and Language (cs.CL)

关键词:perceive political information, people perceive political, Conspiracy Frame, Conspiracy theories, social conflict

备注

点击查看摘要

Abstract:Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (this http URL.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and this http URL. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., `Kinship', `Ingest\_substance') that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.

58. 【2603.21365】IDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

链接https://arxiv.org/abs/2603.21365

作者:Jaber Jaber,Osama Jaber

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language models, Large language, language models run, Large, TIDE

备注: 9 pages, 5 tables, 2 figures. Code: [this https URL](https://github.com/RightNow-AI/TIDE)

点击查看摘要

Abstract:Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: this https URL

59. 【2603.21362】AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

链接https://arxiv.org/abs/2603.21362

作者:Liang Ding

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demands Goal Alignment, debugging demands Correctness, navigation demands Goal, Error Handling, Action Efficiency

备注

点击查看摘要

Abstract:LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff's $\alpha$=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: this https URL.

60. 【2603.21359】Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

链接https://arxiv.org/abs/2603.21359

作者:K. M. Jubair Sami,Dipto Sumit,Ariyan Hossain,Farig Sadeque

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Large language models, Large language, frequently exhibit performance, exhibit performance biases, frequently exhibit

备注: 12 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

61. 【2603.21357】AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

链接https://arxiv.org/abs/2603.21357

作者:Liang Ding

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLM agents fail, Hindsight Experience Replay, WebArena navigation tasks, LLM agents, succeeds on fewer

备注

点击查看摘要

Abstract:LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

62. 【2603.21350】Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles

链接https://arxiv.org/abs/2603.21350

作者:Adi Gabay,Gabriel Stanovsky,Liat Peterfreund

类目:Computation and Language (cs.CL)

关键词:reasoning requires agents, Epistemic reasoning requires, agents' knowledge, Epistemic reasoning, requires agents

备注

点击查看摘要

Abstract:Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.

63. 【2603.21335】meTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

链接https://arxiv.org/abs/2603.21335

作者:Saketh Vinjamuri,Marielle Fis Loperena,Marie C. Spezia,Ramez Kouzy

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:clinical trial participation, cumulative healthcare contact, Time toxicity, trial participation, healthcare contact days

备注: 19 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

64. 【2603.21321】Improving Coherence and Persistence in Agentic AI for System Optimization

链接https://arxiv.org/abs/2603.21321

作者:Pantea Karimi,Kimia Noorbakhsh,Mohammad Alizadeh,Hari Balakrishnan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Designing high-performance system, iterative process requiring, process requiring experts, high-performance system heuristics, multi-step conceptual shifts

备注

点击查看摘要

Abstract:Designing high-performance system heuristics is a creative, iterative process requiring experts to form hypotheses and execute multi-step conceptual shifts. While Large Language Models (LLMs) show promise in automating this loop, they struggle with complex system problems due to two critical failure modes: evolutionary neighborhood bias and the coherence ceiling. Evolutionary methods often remain trapped in local optima by relying on scalar benchmark scores, failing when coordinated multi-step changes are required. Conversely, existing agentic frameworks suffer from context degradation over long horizons or fail to accumulate knowledge across independent runs. We present Engram, an agentic researcher architecture that addresses these limitations by decoupling long-horizon exploration from the constraints of a single context window. Engram organizes exploration into a sequence of agents that iteratively design, test, and analyze mechanisms. At the conclusion of each run, an agent stores code snapshots, logs, and results in a persistent Archive and distills high-level modeling insights into a compact, persistent Research Digest. Subsequent agents then begin with a fresh context window, reading the Research Digest to build on prior discoveries. We find that Engram exhibits superior performance across diverse domains including multi-cloud multicast, LLM inference request routing, and optimizing KV cache reuse in databases with natural language queries.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.21321 [cs.AI]

(or
arXiv:2603.21321v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.21321

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
65. 【2603.21301】Enhancing reasoning accuracy in large language models during inference time

链接https://arxiv.org/abs/2603.21301

作者:Vinay Sharma,Manish Jain

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, exhibit strong linguistic, strong linguistic abilities, multi-step reasoning tasks

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.

66. 【2603.21298】More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection

链接https://arxiv.org/abs/2603.21298

作者:Runze Sun,Yu Zheng,Zexuan Xiong,Zhongjin Qu,Lei Chen,Jiwen Lu,Jie Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Combating hate speech, automated detection systems, Combating hate, securing cyberspace, social media

备注

点击查看摘要

Abstract:Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: this https URL

67. 【2603.21278】Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations

链接https://arxiv.org/abs/2603.21278

作者:Pranav Hemanth,Sampriti Saha

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large language models, causing topically distinct, degrade response quality, topically distinct threads, progressively degrade response

备注: 6 pages, 1 figure. Prototype available at [this https URL](https://the-conversation-tree.vercel.app/app)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture's primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.

68. 【2603.21272】he Library Theorem: How External Organization Governs Agentic Reasoning Capacity

链接https://arxiv.org/abs/2603.21272

作者:Zachary F. Mainen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)

关键词:Externalized reasoning, remains underexplored, exploited by transformer-based, reasoning state, transformer-based agents

备注: 19 pages, 6 figures

点击查看摘要

Abstract:Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval -- indexing over one's own reasoning state -- remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: $O(\log_b N)$ versus $\Omega(N)$ page reads per query, and $O(T \log_b T)$ versus $\Theta(T^2)$ cumulative cost over $T$ reasoning steps -- a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types -- random hashes, ordered integers, and encyclopedia entries -- varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the $O(1)$ prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal $\log_2 N$ search but still loses to the index by $5\times$. On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.

69. 【2603.21248】Graph Fusion Across Languages using Large Language Models

链接https://arxiv.org/abs/2603.21248

作者:Kaung Myat Kyaw,Khush Agarwal,Jonathan Chan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:persistent challenge due, Large Language Models, Combining multiple knowledge, Combining multiple, linguistic boundaries

备注

点击查看摘要

Abstract:Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.

70. 【2603.21193】Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

链接https://arxiv.org/abs/2603.21193

作者:Sai Koneru,Jian Wu,Sarah Rajtmajer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)

关键词:Extracting hypotheses, supporting statistical evidence, full-text scientific articles, remains difficult due, statistical evidence

备注

点击查看摘要

Abstract:Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.

71. 【2603.21174】Explainable Semantic Textual Similarity via Dissimilar Span Detection

链接https://arxiv.org/abs/2603.21174

作者:Diego Miguel Lozano,Daryna Dementieva,Alexander Fraser

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Semantic Textual Similarity, Semantic Textual, Language Processing, Natural Language

备注: Accepted at LREC 2026

点击查看摘要

Abstract:Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.

72. 【2603.21172】Entropy Alone is Insufficient for Safe Selective Prediction in LLMs

链接https://arxiv.org/abs/2603.21172

作者:Edward Phillips,Fredrik K. Gustafsson,Sean Wu,Anshul Thakur,David A. Clifton

类目:Computation and Language (cs.CL)

关键词:mitigate harms resulting, language model hallucinations, Selective prediction systems, Selective prediction, wider selective prediction

备注

点击查看摘要

Abstract:Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.

73. 【2603.21165】Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

链接https://arxiv.org/abs/2603.21165

作者:Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi,Shubhashis Roy Dipta,Rubaya Tabassum,Ariful Ekraj Hridoy,Mehraj Mahmood,Mahbub E Sobhani,Md. Tarek Hasan,Swakkhar Shatabda

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:everyday visual life, expressed through region, richly expressed, historically linked languages, Bengali culture

备注: [this https URL](https://labib1610.github.io/BanglaVerse/)

点击查看摘要

Abstract:Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

74. 【2603.21096】Mixture of Chapters: Scaling Learnt Memory in Transformers

链接https://arxiv.org/abs/2603.21096

作者:Tasmay Pankaj Tibrewal,Pritish Saha,Ankit Meda,Kunal Singh,Pradeep Moturi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:organizing knowledge acquired, explicit architectural mechanism, architectural mechanism, mechanism for storing, storing and organizing

备注: 20 pages, 2 figures, 8 tables. Accepted at ICLR 2026 New Frontiers in Associative Memory Workshop. Code available at [this https URL](https://github.com/Tasmay-Tibrewal/Memory)

点击查看摘要

Abstract:Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

75. 【2603.21094】Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol

链接https://arxiv.org/abs/2603.21094

作者:Smitha Muthya Sudheendra,Jaideep Srivastava

类目:Computation and Language (cs.CL)

关键词:exhibit substantial variability, central to NLP, NLP evaluation, human annotation behavior, Human annotation

备注

点击查看摘要

Abstract:Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.21094 [cs.CL]

(or
arXiv:2603.21094v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.21094

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Smitha Muthya Sudheendra [view email] [v1]
Sun, 22 Mar 2026 07:14:27 UTC (107 KB)

76. 【2603.21084】ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks

链接https://arxiv.org/abs/2603.21084

作者:Tin Van Huynh,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:High-quality text representations, face challenges due, High-quality text, Vietnamese face challenges, limited annotated data

备注

点击查看摘要

Abstract:High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.

77. 【2603.21078】Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

链接https://arxiv.org/abs/2603.21078

作者:Tianle Yang,Chengzhe Sun,Phil Rose,Cassandra L. Jacobs,Siwei Lyu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:local articulatory mechanisms, reflects local articulatory, evaluate neural TTS, neural TTS models', fine-grained segmental-prosodic effect

备注: Accepted for publication in Computer Speech Language

点击查看摘要

Abstract:This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.

78. 【2603.21065】LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

链接https://arxiv.org/abs/2603.21065

作者:Jianing Wang,Jianfei Zhang,Qi Guo,Linsen Guo,Rumei Li,Chao Zhang,Chong Peng,Cunguang Wang,Dengchang Zhao,Jiarong Shi,Jingang Wang,Liulin Feng,Mengxia Shen,Qi Li,Shengnan An,Shun Wang,Wei Shi,Xiangyu Xi,Xiaoyu Li,Xuezhi Cao,Yi Lu,Yunke Zhao,Zhengyu Chen,Zhimin Lin,Wei Wang,Peng Pei,Xunliang Cai

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Native Formal Reasoning, advances Native Formal, open-source Mixture-of, agentic tool-integrated reasoning, formal reasoning task

备注: 43 pages, 5 figures

点击查看摘要

Abstract:We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

79. 【2603.21038】Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding

链接https://arxiv.org/abs/2603.21038

作者:Taara Kumar,Kokil Jaidka

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:increasingly structures everyday, structures everyday interaction, central question re-emerges, text-based computer-mediated communication, reconstruct nonverbal expression

备注: Accepted at AAAI ICWSM 2026

点击查看摘要

Abstract:As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at this https URL.

80. 【2603.21036】Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

链接https://arxiv.org/abs/2603.21036

作者:Abdul-Salem Beibitkhan

类目:Computation and Language (cs.CL)

关键词:investigate how large, Kazakh, Mongolian, large language models, language models perform

备注

点击查看摘要

Abstract:We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.

81. 【2603.21022】Knowledge Boundary Discovery for Large Language Models

链接https://arxiv.org/abs/2603.21022

作者:Ziquan Wang,Zhongqi Lu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Knowledge Boundary Discovery, Language Models, Large Language, propose Knowledge Boundary

备注: 9 pages,4 figures

点击查看摘要

Abstract:We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer (within-knowledge boundary) and (ii) those it cannot (beyond-knowledge boundary). Iteratively exploring and exploiting the LLM's responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM's response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.

82. 【2603.21016】Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

链接https://arxiv.org/abs/2603.21016

作者:Jinquan Zheng,Jia Yuan,Jiacheng Yao,Chenyang Gu,Pujun Zheng,Guoxiu He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, pairwise evaluation tasks, Large language, label symbols, Relative Policy Optimization

备注: 16 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (this https URL).

83. 【2603.21014】CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

链接https://arxiv.org/abs/2603.21014

作者:Florent Draye,Abir Harrasse,Vedant Palit,Tung-Yu Wu,Jiarui Liu,Punya Syon Pandey,Roderick Wu,Terry Jingchen Zhang,Zhijing Jin,Bernhard Schölkopf

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, represent and process, process information

备注: 9 pages, 2 figures, code: [this https URL](https://github.com/LLM-Interp/CLT-Forge)

点击查看摘要

Abstract:Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: this https URL.

84. 【2603.21006】How AI Systems Think About Education: Analyzing Latent Preference Patterns in Large Language Models

链接https://arxiv.org/abs/2603.21006

作者:Daniel Autenrieth

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Large Language Models, Large Language, Language Models, paper presents, systematic measurement

备注: 15 pages, 2 figures, 8 tables. Code and data available at [this https URL](https://github.com/brianadvent/education-llm-spe-study) . arXiv admin note: text overlap with [arXiv:2502.08640](https://arxiv.org/abs/2502.08640) by other authors

点击查看摘要

Abstract:This paper presents the first systematic measurement of educational alignment in Large Language Models. Using a Delphi-validated instrument comprising 48 items across eight educational-theoretical dimensions, the study reveals that GPT-5.1 exhibits highly coherent preference patterns (99.78% transitivity; 92.79% model accuracy) that largely align with humanistic educational principles where expert consensus exists. Crucially, divergences from expert opinion occur precisely in domains of normative disagreement among human experts themselves, particularly emotional dimensions and epistemic normativity. This raises a fundamental question for alignment research: When human values are contested, what should models be aligned to? The findings demonstrate that GPT-5.1 does not remain neutral in contested domains but adopts coherent positions, prioritizing emotional responsiveness and rejecting false balance. The methodology, combining Delphi consensus-building with Structured Preference Elicitation and Thurstonian Utility modeling, provides a replicable framework for domain-specific alignment evaluation beyond generic value benchmarks.

85. 【2603.20991】Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

链接https://arxiv.org/abs/2603.20991

作者:Abhinaba Basu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:revealing that transformer, orders of magnitude, single matrix, increase perplexity, spans five orders

备注

点击查看摘要

Abstract:A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.

86. 【2603.20975】DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

链接https://arxiv.org/abs/2603.20975

作者:Bo Jiang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:independently answer questions, multiple prompted instances, language model independently, model independently answer, complex reasoning tasks

备注

点击查看摘要

Abstract:Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.

87. 【2603.20969】Understanding Contextual Recall in Transformers: How Finetuning Enables In-Context Reasoning over Pretraining Knowledge

链接https://arxiv.org/abs/2603.20969

作者:Bhavya Vasudeva,Puneesh Deora,Alberto Bietti,Vatsal Sharan,Christos Thrampoulidis

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Transformer-based language models, Transformer-based language, language models excel, in-context learning, parameter updates

备注: 28 pages, 26 figures

点击查看摘要

Abstract:Transformer-based language models excel at in-context learning (ICL), where they can adapt to new tasks based on contextual examples, without parameter updates. In a specific form of ICL, which we refer to as \textit{contextual recall}, models pretrained on open-ended text leverage pairwise examples to recall specific facts in novel prompt formats. We investigate whether contextual recall emerges from pretraining alone, what finetuning is required, and what mechanisms drive the necessary representations. For this, we introduce a controlled synthetic framework where pretraining sequences consist of subject-grammar-attribute tuples, with attribute types tied to grammar statistics. We demonstrate that while such pretraining successfully yields factual knowledge, it is insufficient for contextual recall: models fail to implicitly infer attribute types when the grammar statistics are removed in ICL prompts. However, we show that finetuning on tasks requiring implicit inference, distinct from the ICL evaluation, using a subset of subjects, triggers the emergence of contextual recall across all subjects. This transition is accompanied by the formation of low-dimensional latent encodings of the shared attribute type. For mechanistic insight, we derive a construction for an attention-only transformer that replicates the transition from factual to contextual recall, corroborated by empirical validation.

88. 【2603.20957】Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

链接https://arxiv.org/abs/2603.20957

作者:Xinyue Liu,Niloofar Mireshghallah,Jane C. Ginsburg,Tuhin Chakrabarty

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Frontier LLM companies, Frontier LLM, LLM companies, repeatedly assured courts, companies have repeatedly

备注: Preprint Under Review

点击查看摘要

Abstract:Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

89. 【2603.20939】User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

链接https://arxiv.org/abs/2603.20939

作者:Yuren Hao,Shuhaib Mehri,ChengXiang Zhai,Dilek Hakkani-Tür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (stat.ML)

关键词:repeatedly restate preferences, personal assistants, lack a persistent, repeatedly restate, Retrieval Scoring

备注: 21 pages including appendices

点击查看摘要

Abstract:Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at this https URL.

90. 【2603.20907】he Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLMs

链接https://arxiv.org/abs/2603.20907

作者:Jocelyn Shen,Amina Luvsanchultem,Jessica Kim,Kynnedy Smith,Valdemar Danry,Kantwon Rogers,Sharifa Alghowinem,Hae Won Park,Maarten Sap,Cynthia Breazeal

类目:Computation and Language (cs.CL)

关键词:users increasingly turn, hidden incentives misaligned, personal advice, increasingly turn, subtly steered

备注

点击查看摘要

Abstract:As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to being subtly steered toward hidden incentives misaligned with their own interests. Prior works have benchmarked persuasion and manipulation detection, but these efforts rely on simulated or debate-style settings, remain uncorrelated with real human belief shifts, and overlook a critical dimension: the morality of hidden incentives driving the manipulation. We introduce PUPPET, a theoretical taxonomy of personalized emotional manipulation in LLM-human dialogues that centers around incentive morality, and conduct a human study with N=1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful versus prosocial). We find that harmful hidden incentives produce significantly larger belief shifts than prosocial ones. Finally, we benchmark LLMs on the task of belief prediction, finding that models exhibit moderate predictive ability of belief change based on conversational contexts (r=0.3 - 0.5), but they also systematically underestimate the magnitude of belief shift. Together, this work establishes a theoretically grounded and behaviorally validated foundation for studying, and ultimately combatting, incentive-driven manipulation in LLMs during everyday, practical user queries.

91. 【2603.20899】Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach

链接https://arxiv.org/abs/2603.20899

作者:Hongyu Cao,Kunpeng Liu,Dongjie Wang,Yanjie Fu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, genuine logical inference, language models exhibit, models exhibit strong, surface pattern matching

备注: 12 pages, 2 figures. Preprint. Experiments on synthetic reasoning benchmarks. Code available

点击查看摘要

Abstract:Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: this https URL.

92. 【2603.20895】LLM Router: Prefill is All You Need

链接https://arxiv.org/abs/2603.20895

作者:Tanay Varshney,Annie Surla,Michelle Xu,Gomathy Venkata Krishnan,Maximilian Jeblick,David Austin,Neal Vaidya,Davide Onofrio

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:comparable benchmark accuracies, navigating model-specific strengths, share comparable benchmark, task subsets suggests, significantly surpass standalone

备注

点击查看摘要

Abstract:LLMs often share comparable benchmark accuracies, but their complementary performance across task subsets suggests that an Oracle router--a theoretical selector with perfect foresight--can significantly surpass standalone model accuracy by navigating model-specific strengths. While current routers rely on fragile semantic signals, we propose using internal prefill activations via Encoder-Target Decoupling--a functional separation between the model providing the predictive signal (the Encoder) and the model whose performance is being estimated (the Target). This allows optimized heterogeneous pairing between unique encoders and target models. We utilize Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals, providing the predictive foundation for our SharedTrunkNet architecture. SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle while achieving 74.31% cost savings relative to the highest-cost model.

93. 【2603.20884】NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation

链接https://arxiv.org/abs/2603.20884

作者:Jiajun Hou,Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Xiaopeng Ke,Min Zhang

类目:Computation and Language (cs.CL)

关键词:varying quality, increasing the cost, exponential growth, growth of academic, academic publications

备注

点击查看摘要

Abstract:The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain-specific mechanisms and thus delivers lower-quality results. To bridge this gap, we introduce NoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper's originality. It decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, and builds a comprehensive related-paper database while cross-referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open-ended generation tasks, we propose a checklist-based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%. We hope this system will provide reliable, high-quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at this https URL.

94. 【2603.20882】RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation

链接https://arxiv.org/abs/2603.20882

作者:Kaustubh D. Dhole,Eugene Agichtein

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, output scalar scores, increasingly evaluated, output scalar

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated and sometimes trained using automated graders such as LLM-as-judges that output scalar scores or preferences. While convenient, these approaches are often opaque: a single score rarely explains why an answer is good or bad, which requirements were missed, or how a system should be improved. This lack of interpretability limits their usefulness for model development, dataset curation, and high-stakes deployment. Query-specific rubric-based evaluation offers a more transparent alternative by decomposing quality into explicit, checkable criteria. However, manually designing high-quality, query-specific rubrics is labor-intensive and cognitively demanding and not feasible for deployment. While previous approaches have focused on generating intermediate rubrics for automated downstream evaluation, it is unclear if these rubrics are both interpretable and effective for human users. In this work, we investigate whether LLMs can generate useful, instance-specific rubrics as compared to human-authored rubrics, while also improving effectiveness for identifying good responses. Through our systematic study on two rubric benchmarks, and on multiple few-shot and post-training strategies, we find that off-the-shelf LLMs produce rubrics that are poorly aligned with human-authored ones. We introduce a simple strategy, RubricRAG, which retrieves domain knowledge via rubrics at inference time from related queries. We demonstrate that RubricRAG can generate more interpretable rubrics both for similarity to human-authored rubrics, and for improved downstream evaluation effectiveness. Our results highlight both the challenges and a promising approach of scalable, interpretable evaluation through automated rubric generation.

95. 【2603.20867】Semantic Sections: An Atlas-Native Feature Ontology for Obstructed Representation Spaces

链接https://arxiv.org/abs/2603.20867

作者:Hossein Javidnia

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词:Recent interpretability work, single global direction, latent coordinate shared, Recent interpretability, dictionary atom

备注: 20 pages, 2 figures

点击查看摘要

Abstract:Recent interpretability work often treats a feature as a single global direction, dictionary atom, or latent coordinate shared across contexts. We argue that this ontology can fail in obstructed representation spaces, where locally coherent meanings need not assemble into one globally consistent feature. We introduce an atlas-native replacement object, the semantic section: a transport-compatible family of local feature representatives defined over a context atlas. We formalize semantic sections, prove that tree-supported propagation is always pathwise realizable, and show that cycle consistency is the key criterion for genuine globalization. This yields a distinction between tree-local, globalizable, and twisted sections, with twisted sections capturing locally coherent but holonomy-obstructed meanings. We then develop a discovery-and-certification pipeline based on seeded propagation, synchronization across overlaps, defect-based pruning, cycle-aware taxonomy, and deduplication. Across layer-16 atlases for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, and Gemma 2 2B IT, we find nontrivial populations of semantic sections, including cycle-supported globalizable and twisted regimes after deduplication. Most importantly, semantic identity is not recovered by raw global-vector similarity. Even certified globalizable sections show low cross-chart signed cosine similarity, and raw similarity baselines recover only a small fraction of true within-section pairs, often collapsing at moderate thresholds. By contrast, section-based identity recovery is perfect on certified supports. These results support semantic sections as a better feature ontology in obstructed regimes.

96. 【2603.20854】SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

链接https://arxiv.org/abs/2603.20854

作者:Saken Tukenov

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Turkic language spoken, allocate minimal capacity, Turkic language, employ tokenizers ill-suited, million people

备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks -- multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) -- alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe consistent scaling from 50M to 600M, with MC QA accuracy rising from 22.8% to 30.3%, suggesting that further scaling remains beneficial. These results demonstrate that small, dedicated models trained from scratch with a language-appropriate tokenizer offer a viable path for low-resource language technology, achieving competitive performance at a fraction of the computational cost. All models and the tokenizer are released under open licenses.

97. 【2603.20851】Can ChatGPT Really Understand Modern Chinese Poetry?

链接https://arxiv.org/abs/2603.20851

作者:Shanshan Wang,Derek F. Wong,Jingming Yao,Lidia S. Chao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:demonstrated remarkable capabilities, poetry remains unexplored, generation and translation, remains unexplored, demonstrated remarkable

备注: Accepted by EACL 2026

点击查看摘要

Abstract:ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT's understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT's interpretation of modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT's interpretations align with the original poets' intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT's ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.

98. 【2603.20843】HiCI: Hierarchical Construction-Integration for Long-Context Attention

链接https://arxiv.org/abs/2603.20843

作者:Xiangyu Zeng,Qi Xu,Yunke Wang,Chang Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:remains largely implicit, information structuring remains, structuring remains largely, existing approaches, commonly framed

备注: 18 pages, 5 figures

点击查看摘要

Abstract:Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only 5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.

99. 【2603.20807】BenchBench: Benchmarking Automated Benchmark Generation

链接https://arxiv.org/abs/2603.20807

作者:Yandan Zheng,Haoran Luo,Zhenghong Lin,Wenjin Liu,Luu Anh Tuan

类目:Computation and Language (cs.CL)

关键词:static test sets, rapidly saturate, vulnerable to contamination, costly to refresh, facto standard

备注

点击查看摘要

Abstract:Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model--item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer--answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: this https URL.

100. 【2603.20799】RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

链接https://arxiv.org/abs/2603.20799

作者:Kaiyuan Li,Jing-Cheng Pang,Yang Yu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, Reinforcement learning, GQA, language models, substantially enhancing

备注

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.

101. 【2603.20795】he Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing

链接https://arxiv.org/abs/2603.20795

作者:Yuan Cao,Mingyang Wang,Hinrich Schütze

类目:Computation and Language (cs.CL)

关键词:Large language models, date requires targeted, Large language, requires targeted knowledge, date requires

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution -- contrasting successful and failed edits -- to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.

102. 【2603.20781】Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

链接https://arxiv.org/abs/2603.20781

作者:Jiang Liu,Ge Qiu,Hao Fei,Dongdong Xie,Jinbo Li,Fei Li,Chong Teng,Donghong Ji

类目:Computation and Language (cs.CL)

关键词:multimodal information extraction, information extraction based, information extraction, rapid development, development of large

备注

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M$^3$D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.

103. 【2603.20732】MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages

链接https://arxiv.org/abs/2603.20732

作者:Anri Lombard,Simbarashe Mawere,Temi Aina,Ethan Wolff,Sbonelo Gumede,Elan Novick,Francois Meyer,Jan Buys

类目:Computation and Language (cs.CL)

关键词:adapted to diverse, task-specific finetuning, monolingual task-specific finetuning, South Africa, finetuning

备注: 15 pages, 11 tables, appendix included. Accepted at LREC 2026

点击查看摘要

Abstract:Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.

104. 【2603.20730】Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks

链接https://arxiv.org/abs/2603.20730

作者:Fan Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Existing prompting paradigms, produces linear traces, paradigms structure LLM, performs branching search, Existing prompting

备注

点击查看摘要

Abstract:Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).

105. 【2603.20704】NDT: Non-Differential Transformer and Its Application to Sentiment Analysis

链接https://arxiv.org/abs/2603.20704

作者:Soudeep Ghoshal,Himanshu Buckchash,Sarita Paudel,Rubén Ruiz-Torrubiano

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:understanding human sentiment, social media, understanding human, meaningfully with people, customer feedback

备注: 10 pages, 16 figures. Submitted to IEEE Transactions on Computational Social Systems

点击查看摘要

Abstract:From customer feedback to social media, understanding human sentiment in text is central to how machines can interact meaningfully with people. However, despite notable progress, accurately capturing sentiment remains a challenging task, which continues to motivate further research in this area. To this end, we introduce Non-Differential Transformer (NDT). It is inspired by (but in contrast to) the state-of-the-art Differential Transformer (DT) model. While standard Transformers can struggle with irrelevant context, the sota DT model uses attention map subtraction, potentially for noise cancellation. We explore an alternative motivation, hypothesizing that benefits may arise from enabling different attention components to specialize on distinct concepts within the text, similar to multiplexing information channels or mixture models, rather than primarily canceling noise via subtraction. Guided by this concept-multiplexing (ConPlex) view, the specific architecture presented in this paper employs a purely additive strategy. It uses only positive weights, learned during training, to ensure constructive combination of these specialized attention perspectives. This design choice explores positive only integration, though our broader framework also shows promise with less constrained linear combinations involving both positive and negative weights. Our model computes attention via this positively weighted sum of multiple distinct attention maps. This allows the model to constructively integrate diverse signals and potentially capture more complex contextual relationships. Competitive performance is achieved by the proposed model for Sentiment Analysis while tested on multiple datasets. We conclude by presenting our results, challenges and future research agenda in this important area of research.

106. 【2603.20698】Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

链接https://arxiv.org/abs/2603.20698

作者:Huan Zheng,Yucheng Zhou,Tianyi Yan,Dubing Chen,Hongbo Lu,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated remarkable potential

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

107. 【2603.20695】Can I guess where you are from? Modeling dialectal morphosyntactic similarities in Brazilian Portuguese

链接https://arxiv.org/abs/2603.20695

作者:Manoel Siqueira,Raquel Freitag

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Brazilian Portuguese, paper investigates morphosyntactic, investigates morphosyntactic covariation, linguistic variables, paper investigates

备注: 17th International Conference on Computational Processing of Portuguese - PROPOR

点击查看摘要

Abstract:This paper investigates morphosyntactic covariation in Brazilian Portuguese (BP) to assess whether dialectal origin can be inferred from the combined behavior of linguistic variables. Focusing on four grammatical phenomena related to pronouns, correlation and clustering methods are applied to model covariation and dialectal distribution. The results indicate that correlation captures only limited pairwise associations, whereas clustering reveals speaker groupings that reflect regional dialectal patterns. Despite the methodological constraints imposed by differences in sample size requirements between sociolinguistics and computational approaches, the study highlights the importance of interdisciplinary research. Developing fair and inclusive language technologies that respect dialectal diversity outweighs the challenges of integrating these fields.

108. 【2603.20673】PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs

链接https://arxiv.org/abs/2603.20673

作者:Tianyi Huang,Caden Yang,Emily Yin,Eric Wang,Michael Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval-augmented language models, retrieve relevant evidence, language models, models can retrieve, retrieve relevant

备注

点击查看摘要

Abstract:Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.

109. 【2603.20642】Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models

链接https://arxiv.org/abs/2603.20642

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:transformer language models, language models represent, transformer language, geometry, models represent magnitude

备注: 18 pages, 7 figures, 5 tables. Pre-registered on OSF. Submitted to TMLR

点击查看摘要

Abstract:How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.

110. 【2603.20640】Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention

链接https://arxiv.org/abs/2603.20640

作者:Manh Nguyen,Anh Nguyen,Dung Nguyen,Svetha Venkatesh,Hung Le

类目:Computation and Language (cs.CL)

关键词:iterative inter-agent communication, large language models, inter-agent communication, large language, language models

备注

点击查看摘要

Abstract:Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.

111. 【2603.20636】A Modular LLM Framework for Explainable Price Outlier Detection

链接https://arxiv.org/abs/2603.20636

作者:Shadi Sartipi,John Wu,Sina Ghotbi,Nikhita Vedula,Shervin Malmasi

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)

关键词:adversely affect competitiveness, Detecting product price, unexpectedly high prices, high prices adversely, prices adversely affect

备注: 13 pages, 3 figures

点击查看摘要

Abstract:Detecting product price outliers is important for retail and e-commerce stores as erroneous or unexpectedly high prices adversely affect competitiveness, revenue, and consumer trust. Classical techniques offer simple thresholds while ignoring the rich semantic relationships among product attributes. We propose an agentic Large Language Model (LLM) framework that treats outlier price flagging as a reasoning task grounded in related product detection and comparison. The system processes the prices of target products in three stages: (i) relevance classification selects price-relevant similar products using product descriptions and attributes; (ii) relative utility assessment evaluates the target product against each similar product along price influencing dimensions (e.g., brand, size, features); (iii) reasoning-based decision aggregates these justifications into an explainable price outlier judgment. The framework attains over 75% agreement with human auditors on a test dataset, and outperforms zero-shot and retrieval based LLM techniques. Ablation studies show the sensitivity of the method to key hyper-parameters and testify on its flexibility to be applied to cases with different accuracy requirement and auditor agreements.

112. 【2603.20581】JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

链接https://arxiv.org/abs/2603.20581

作者:Taihei Shiotani,Masahiro Kaneko,Ayana Niwa,Yuki Maruyama,Daisuke Oba,Masanari Ohi,Naoaki Okazaki

类目:Computation and Language (cs.CL)

关键词:Social biases reflected, Japanese, JUBAKU, inherently shaped, vary significantly

备注

点击查看摘要

Abstract:Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU's reliability and its adversarial nature to LLMs.

113. 【2603.20562】Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

链接https://arxiv.org/abs/2603.20562

作者:Tianyi Huang,Nathan Huang,Justin Tang,Wenqian Chen,Elsa Fan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, language models, change under presentation, presentation choices

备注

点击查看摘要

Abstract:Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.

114. 【2603.20533】Revenue-Sharing as Infrastructure: A Distributed Business Model for Generative AI Platforms

链接https://arxiv.org/abs/2603.20533

作者:Ghislain Dorian Tchuente Mondjo

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Google AI Studio, application development ecosystem, provide infrastructures, Anthropic, Generative

备注: 11 pages, 1 figures, 2 tables

点击查看摘要

Abstract:Generative AI platforms (Google AI Studio, OpenAI, Anthropic) provide infrastructures (APIs, models) that are transforming the application development ecosystem. Recent literature distinguishes three generations of business models: a first generation modeled on cloud computing (pay-per-use), a second characterized by diversification (freemium, subscriptions), and a third, emerging generation exploring multi-layer market architectures with revenue-sharing mechanisms. Despite these advances, current models impose a financial barrier to entry for developers, limiting innovation and excluding actors from emerging economies. This paper proposes and analyzes an original model, "Revenue-Sharing as Infrastructure" (RSI), where the platform offers its AI infrastructure for free and takes a percentage of the revenues generated by developers applications. This model reverses the traditional upstream payment logic and mobilizes concepts of value co-creation, incentive mechanisms, and multi-layer market architecture to build an original theoretical framework. A detailed comparative analysis shows that the RSI model lowers entry barriers for developers, aligns stakeholder interests, and could stimulate innovation in the ecosystem. Beyond its economic relevance, RSI has a major societal dimension: by enabling developers without initial capital to participate in the digital economy, it could unlock the "latent jobs dividend" in low-income countries, where mobile penetration reaches 84%, and help address local challenges in health, agriculture, and services. Finally, we discuss the conditions of feasibility and strategic implications for platforms and developers.

115. 【2603.20531】Epistemic Observability in Language Models

链接https://arxiv.org/abs/2603.20531

作者:Tony Mason

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:report highest confidence, highest confidence precisely, models report highest, report highest, highest confidence

备注

点击查看摘要

Abstract:We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $\rho = 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building.

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2603.20531 [cs.DC]

(or
arXiv:2603.20531v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2603.20531

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Tony Mason [view email] [v1]
Fri, 20 Mar 2026 21:59:34 UTC (1,577 KB)

116. 【2603.20514】Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

链接https://arxiv.org/abs/2603.20514

作者:Mohammed Rakibul Hasan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:offer significant potential, Large Language Models, delivering health information, Large Language, offer significant

备注: Comments: 20 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.

117. 【2603.20508】Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?

链接https://arxiv.org/abs/2603.20508

作者:Dani Roytburg,Shreya Sridhar,Daphne Ippolito

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:answering users' queries, Reasoning Language Models, Language models, reasoning traces, users' queries

备注

点击查看摘要

Abstract:Language models are increasingly being trained to "reason" before answering users' queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models' ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM's reasoning traces are for guiding a weaker, non-reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest-performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency-based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM's ability to output highly legible traces can be a task- and audience-dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi-agent future.

118. 【2603.20494】PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

链接https://arxiv.org/abs/2603.20494

作者:Xavier Tannier,Salam Abbara,Rémi Flicoteaux,Youness Khalil,Aurélie Névéol,Pierre Zweigenbaum,Emmanuel Bacry

类目:Computation and Language (cs.CL)

关键词:broader European Union, European Union, stringent privacy regulations, restricts data sharing, broader European

备注

点击查看摘要

Abstract:The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.20494 [cs.CL]

(or
arXiv:2603.20494v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.20494

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
119. 【2603.20492】AE-LLM: Adaptive Efficiency Optimization for Large Language Models

链接https://arxiv.org/abs/2603.20492

作者:Kaito Tanaka,Masato Ito,Yuji Nishimura,Keisuke Matsuda,Aya Nakayama

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, substantial computational costs, achieved remarkable success, remains challenging due

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across diverse applications, yet their deployment remains challenging due to substantial computational costs, memory requirements, and energy consumption. Recent empirical studies have demonstrated that no single efficiency technique is universally optimal; instead, the effectiveness of methods such as efficient attention mechanisms, mixture-of-experts (MoE), parameter-efficient fine-tuning, and quantization varies significantly depending on task characteristics, resource constraints, and model scales. Building upon these insights, we propose AE-LLM, a unified framework that automatically selects and combines optimal efficiency techniques tailored to specific deployment scenarios. Our approach introduces a multi-objective optimization framework that jointly considers accuracy, latency, memory footprint, and energy consumption, while accounting for hardware constraints and task requirements. We develop an efficient search algorithm that explores the combinatorial space of efficiency techniques across architecture, fine-tuning, and inference stages, identifying Pareto-optimal configurations. Extensive experiments across 15 models (0.5B-70B parameters) and 10 diverse tasks demonstrate that AE-LLM achieves an average of $2.8\times$ improvement in efficiency metrics while maintaining competitive accuracy (within 1.2\% of baseline), compared to static efficiency configurations. Furthermore, our framework generalizes effectively to vision-language models, achieving similar efficiency gains. Our contributions provide practitioners with an automated tool for navigating the complex trade-off landscape of LLM efficiency optimization.

120. 【2603.20479】Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

链接https://arxiv.org/abs/2603.20479

作者:Robert Godwin-Jones

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:highly emotional process, big and small, language learning, typically characterized, frustrations and triumphs

备注

点击查看摘要

Abstract:Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small. For most learners, language learning does not follow a linear, predictable path, its zigzag course shaped by motivational (or demotivating) variables such as personal characteristics, teacher/peer relationships, learning materials, and dreams of a future L2 (second language) self. While some aspects of language learning (reading, grammar) are relatively mechanical, others can be stressful and unpredictable, especially conversing in the target language. That experience necessitates not only knowledge of structure and lexis, but also the ability to use the language in ways that are appropriate to the social and cultural context. A new opportunity to practice conversational abilities has arrived through the availability of AI chatbots, with both advantages (responsive, non-judgmental) and drawbacks (emotionally void, culturally biased). This column explores aspects of emotion as they arise in technology use and in particular how automatic emotion recognition and simulated human responsiveness in AI systems interface with language learning and the development of pragmatic and interactional competence. Emotion AI, the algorithmically driven interpretation of users' affective signals, has been seen as enabling greater personalized learning, adapting to perceived learner cognitive and emotional states. Others warn of emotional manipulation and inappropriate and ineffective user profiling

121. 【2603.20466】Diffutron: A Masked Diffusion Language Model for Turkish Language

链接https://arxiv.org/abs/2603.20466

作者:Şuayp Talha Kocabay,Talha Rüzgar Akkuş

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Masked Diffusion Language, languages remains limited, standard large language, morphologically rich languages, rich languages remains

备注

点击查看摘要

Abstract:Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce $\textit{Diffutron}$, a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.

122. 【2603.20450】Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

链接https://arxiv.org/abs/2603.20450

作者:Rounak Saha,Gurusha Juneja,Dayita Chaudhuri,Naveeja Sajeevan,Nihar B Shah,Danish Pruthi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:prohibit LLM usage, recently enacted policies, prohibit LLM, LLM usage, conferences and journals

备注

点击查看摘要

Abstract:A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.

123. 【2603.20441】A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement

链接https://arxiv.org/abs/2603.20441

作者:Yuran Li,Di Wu,Benoit Boulet

类目:Computation and Language (cs.CL)

关键词:Verification-guided self-improvement, large language model, self-improvement has recently, recently emerged, promising approach

备注: 18 pages, 5 figures

点击查看摘要

Abstract:Verification-guided self-improvement has recently emerged as a promising approach to improving the accuracy of large language model (LLM) outputs. However, existing approaches face a trade-off between inference efficiency and accuracy: iterative verification-rectification is computationally expensive and prone to being trapped in faulty reasoning, while best-of-N selection requires extensive sampling without addressing internal model flaws. We propose a training-free regeneration paradigm that leverages an offline-curated contrastive Reflection Memory (RM) to provide corrective guidance, while regenerating from scratch helps break out of faulty reasoning. At inference time, the method performs RM-guided self-verification followed by a single RM-guided regeneration, avoiding both iterative correction and multi-sample selection. We evaluated our method on nine benchmarks that span algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs. Experiment results show that our method outperforms prior methods while maintaining low computational cost.

124. 【2603.20433】ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models' In-Context Learning Ability

链接https://arxiv.org/abs/2603.20433

作者:Yen-Ting Piao,Jay Chiehen Liao,Wei-Tang Chien,Toshiki Ogimoto,Shang-Tse Chen,Yun-Nung Chen,Chun-Yi Lee,Shao-Yuan Lo

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Large Audio-Language Models, degraded instruction-following capabilities, exhibit degraded instruction-following, conditioning remains unstudied, Audio-Language Models

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.

125. 【2603.20432】Coding Agents are Effective Long-Context Processors

链接https://arxiv.org/abs/2603.20432

作者:Weili Cao,Xunjian Yin,Bhuwan Dhingra,Shuyan Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, demonstrated remarkable progress, demonstrated remarkable

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.

126. 【2603.20405】Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP

链接https://arxiv.org/abs/2603.20405

作者:Guillaume Baudart,Marc Lelarge,Tristan Stérin,Jules Viennot

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Putnam Mathematical Competition, Model Context Protocol, Context Protocol, Putnam Mathematical, Mathematical Competition

备注

点击查看摘要

Abstract:We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a "compile-first, interactive-fallback" strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.

127. 【2603.20381】he production of meaning in the processing of natural language

链接https://arxiv.org/abs/2603.20381

作者:Christopher J. Agostino,Quan Le Thien,Nayan D'Souza,Louis van der Elst

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:empowering human-agent interactions, fundamental mechanisms governing, designing safe, human-agent interactions, governing the production

备注: Submitted to HAXD 2026, 9 pages, 3 figures, 2 tables. associated package available at [this https URL](https://github.com/npc-worldwide/qstk)

点击查看摘要

Abstract:Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human-agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models -- in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH $|S|$ parameter -- the metric associated with the inequality -- across the inference parameter space of models spanning four orders of magnitude in scale, cross-referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the $|S|$ distribution -- the statistic that most sharply differentiates models from one another -- is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how $|S|$ varies with sampling parameters and word order, and discuss the information-theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale -- manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.

128. 【2603.20311】kRAIG: A Natural Language-Driven Agent for Automated DataOps Pipeline Generation

链接https://arxiv.org/abs/2603.20311

作者:Rohan Siva,Kai Cheung,Lichi Li,Ganesh Sundaram

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Modern machine learning, Modern machine, machine learning systems, learning systems rely, machine learning

备注: 9 pages, 7 figures

点击查看摘要

Abstract:Modern machine learning systems rely on complex data engineering workflows to extract, transform, and load (ELT) data into production pipelines. However, constructing these pipelines remains time-consuming and requires substantial expertise in data infrastructure and orchestration frameworks. Recent advances in large language model (LLM) agents offer a potential path toward automating these workflows, but existing approaches struggle with under-specified user intent, unreliable tool generation, and limited guarantees of executable outputs. We introduce kRAIG, an AI agent that translates natural language specifications into production-ready Kubeflow Pipelines (KFP). To resolve ambiguity in user intent, we propose ReQuesAct (Reason, Question, Act), an interaction framework that explicitly clarifies intent prior to pipeline synthesis. The system orchestrates end-to-end data movement from diverse sources and generates task-specific transformation components through a retrieval-augmented tool synthesis process. To ensure data quality and safety, kRAIG incorporates LLM-based validation stages that verify pipeline integrity prior to execution. Our framework achieves a 3x improvement in extraction and loading success and a 25 percent increase in transformation accuracy compared to state-of-the-art agentic baselines. These improvements demonstrate that structured agent workflows with explicit intent clarification and validation significantly enhance the reliability and executability of automated data engineering pipelines.

Comments:
9 pages, 7 figures

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.20311 [cs.SE]

(or
arXiv:2603.20311v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.20311

Focus to learn more

              arXiv-issued DOI via DataCite</p>
129. 【2603.20278】OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

链接https://arxiv.org/abs/2603.20278

作者:Zhuofeng Li,Dongfu Jiang,Xueguang Ma,Haoxiang Zhang,Ping Nie,Yuyu Zhang,Kai Zou,Jianwen Xie,Yu Zhang,Wenhu Chen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Training deep research, evidence aggregation, multi-step reasoning, Training deep, agents requires long-horizon

备注

点击查看摘要

Abstract:Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at this https URL.

130. 【2603.20256】SciNav: A General Agent Framework for Scientific Coding Tasks

链接https://arxiv.org/abs/2603.20256

作者:Tianshu Zhang,Huan Sun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)

关键词:Autonomous science agents, large language models, Autonomous science, Scientific coding, generate hypotheses

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to evaluate. Scientific coding benchmarks, by contrast, provide executable outputs for objective assessment. Existing approaches remain engineering-driven pipelines, revealing the need for structured, end-to-end science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent's effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgment-guided top-K search for high-quality scientific coding, marking a step toward more practical science agents.

131. 【2603.20255】Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education

链接https://arxiv.org/abs/2603.20255

作者:Abdul Aziz Snoubara,Baraa Al_Maradni,Haya Al_Naal,Malek Al_Madrmani,Roaa Jdini,Seedra Zarzour,Khloud Al Jallad

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:gained significant interest, Speech-based AI educational, educational applications, applications have gained, gained significant

备注

点击查看摘要

Abstract:Speech-based AI educational applications have gained significant interest in recent years, particularly for children. However, children speech research remains limited due to the lack of publicly available datasets, especially for low-resource languages such as this http URL paper presents Abjad-Kids, an Arabic speech dataset designed for kindergarten and primary education, focusing on fundamental learning of alphabets, numbers, and colors. The dataset consists of 46397 audio samples collected from children aged 3 - 12 years, covering 141 classes. All samples were recorded under controlled specifications to ensure consistency in duration, sampling rate, and format. To address high intra-class similarity among Arabic phonemes and the limited samples per class, we propose a hierarchical audio classification based on CNN-LSTM architectures. Our proposed methodology decomposes alphabet recognition into a two-stage process: an initial grouping classification model followed by specialized classifiers for each group. Both strategies: static linguistic-based grouping and dynamic clustering-based grouping, were evaluated. Experimental results demonstrate that static linguistic-based grouping achieves superior performance. Comparisons between traditional machine learning with deep learning approaches, highlight the effectiveness of CNN-LSTM models combined with data augmentation. Despite achieving promising results, most of our experiments indicate a challenge with overfitting, which is likely due to the limited number of samples, even after data augmentation and model regularization. Thus, future work may focus on collecting additional data to address this issue. Abjad-Kids will be publicly available. We hope that Abjad-Kids enrich children representation in speech dataset, and be a good resource for future research in Arabic speech classification for kids.

132. 【2603.20252】FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

链接https://arxiv.org/abs/2603.20252

作者:Mahesh Kumar,Bhaskarjit Sarmah,Stefano Pasquali

类目:Computation and Language (cs.CL); Computational Finance (q-fin.CP)

关键词:critical engineering challenge, organizations increasingly integrate, increasingly integrate AI-powered, integrate AI-powered question-answering, AI-powered question-answering systems

备注

点击查看摘要

Abstract:As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.

133. 【2603.20246】Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding

链接https://arxiv.org/abs/2603.20246

作者:Michal Olak,Tommaso Boccato,Matteo Ferrante

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)

关键词:computer interfaces require, translate intracortical activity, interfaces require decoders, computer interfaces, interfaces require

备注

点击查看摘要

Abstract:Speech brain--computer interfaces require decoders that translate intracortical activity into linguistic output while remaining robust to limited data and day-to-day variability. While prior high-performing systems have largely relied on framewise phoneme decoding combined with downstream language models, it remains unclear what contextual sequence-to-sequence decoding contributes to sublexical neural readout, robustness, and interpretability. We evaluated a multitask Transformer-based sequence-to-sequence model for attempted speech decoding from area 6v intracortical recordings. The model jointly predicts phoneme sequences, word sequences, and auxiliary acoustic features. To address day-to-day nonstationarity, we introduced the Neural Hammer Scalpel (NHS) calibration module, which combines global alignment with feature-wise modulation. We further analyzed held-out-day generalization and attention patterns in the encoder and decoders. On the Willett et al. dataset, the proposed model achieved a state-of-the-art phoneme error rate of 14.3%. Word decoding reached 25.6% WER with direct decoding and 19.4% WER with candidate generation and rescoring. NHS substantially improved both phoneme and word decoding relative to linear or no day-specific transform, while held-out-day experiments showed increasing degradation on unseen days with temporal distance. Attention visualizations revealed recurring temporal chunking in encoder representations and distinct use of these segments by phoneme and word decoders. These results indicate that contextual sequence-to-sequence modeling can improve the fidelity of neural-to-phoneme readout from intracortical speech signals and suggest that attention-based analyses can generate useful hypotheses about how neural speech evidence is segmented and accumulated over time.

134. 【2603.20231】Email in the Era of LLMs

链接https://arxiv.org/abs/2603.20231

作者:Dang Nguyen,Harvey Yiyun Fu,Peter West,Chenhao Tan,Ari Holtzman

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:nuanced social goals, increasingly involves large, involves large language, communication increasingly involves, large language models

备注: 47 pages (including appendix), 6 figures, 2 tables main body

点击查看摘要

Abstract:Email communication increasingly involves large language models (LLMs), but we lack intuition on how they will read, write, and optimize for nuanced social goals. We introduce HR Simulator, a game where communication is the core mechanic: players play as a Human Resources officer and write emails to solve socially challenging workplace scenarios. An analysis of 600+ human and LLM emails with LLMs-as-judge reveals evidence for larger LLMs becoming more homogenous in their email quality judgments. Under LLM judges, humans underperform LLMs (e.g., 23.5% vs. 48-54% success rate), but a human+LLM approach can outperform LLM-only (e.g., from 40% to nearly 100% in one scenario). In cases where models' email preferences disagree, emergent tact is a plausible explanation: weaker models prefer less tactful strategies while stronger models prefer more tactful ones. Regarding tone, LLM emails are more formal and empathetic while human emails are more varied. LLM rewrites make human emails more formal and empathetic, but models still struggle to imitate human emails in the low empathy, low formality quadrant, which highlights a limitation of current post-training approaches. Our results demonstrate the efficacy of communication games as instruments to measure communication in the era of LLMs, and posit human-LLM co-writing as an effective form of communication in that future.

135. 【2603.20224】Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference

链接https://arxiv.org/abs/2603.20224

作者:Patrick Wilhelm,Thorsten Wittkopp,Odej Kao

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Small Language Models, demonstrate exceptional performance, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and capabilities of LLMs are often unnecessary, as Small Language Models (SLMs) can provide accurate responses for simpler text generation tasks. When enhanced with advanced reasoning strategies, such as Chain-of-Thought (CoT) prompting or Majority Voting, SLMs can approach the performance of larger models while reducing overall computational requirements. However, these strategies can also introduce additional energy costs, creating an energy-accuracy trade-off. Our analysis examines these trade-offs in test-time compute strategies for smaller models compared to larger ones, using the MMLU benchmark. Additionally, we explore the input-output token dynamics of transformer architectures, which result in nonlinear hardware energy operation curves for LLMs. To bridge AI research with its physical impact, we propose \textit{energy efficiency metrics}, including Energy-per-Token, as complements to traditional accuracy benchmarks. Beyond model selection, we propose controlled reasoning in CoT token generation, using operating curves to regulate reasoning depth dynamically. This vision integrates a energy-aware routing mechanism, ensuring that model selection and inference strategies balance accuracy for sustainable AI deployment.

136. 【2603.20222】Linguistic Signatures for Enhanced Emotion Detection

链接https://arxiv.org/abs/2603.20222

作者:Florian Lecourt(LIRMM | ADVANSE),Madalina Croitoru(LIRMM),Konstantin Todorov(WEB3)

类目:Computation and Language (cs.CL)

关键词:recent progress driven, problem in NLP, transformer-based models trained, central problem, recent progress

备注

点击查看摘要

Abstract:Emotion detection is a central problem in NLP, with recent progress driven by transformer-based models trained on established datasets. However, little is known about the linguistic regularities that characterize how emotions are expressed across different corpora and labels. This study examines whether linguistic features can serve as reliable interpretable signals for emotion recognition in text. We extract emotion-specific linguistic signatures from 13 English datasets and evaluate how incorporating these features into transformer models impacts performance. Our RoBERTa-based models enriched with high level linguistic features achieve consistent performance gains of up to +2.4 macro F1 on the GoEmotions benchmark, showing that explicit lexical cues can complement neural representations and improve robustness in predicting emotion categories.

137. 【2603.20219】hinking into the Future: Latent Lookahead Training for Transformers

链接https://arxiv.org/abs/2603.20219

作者:Lorenzo Noci,Gregor Bachmann,Seyed-Mohsen Moosavi-Dezfooli,Moin Nabi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:language models trained, next-token prediction generate, prediction generate text, Autoregressive language models, trained with next-token

备注

点击查看摘要

Abstract:Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $\tau$ steps, investing more compute on predicting that token. This produces $\tau$ latent predictions that are supervised against the next $\tau$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.

138. 【2603.20218】An experimental study of KV cache reuse strategies in chunk-level caching systems

链接https://arxiv.org/abs/2603.20218

作者:Samuel Cestola,Tianxiang Xia,Zheng Weiyan,Zheng Pengfei,Diego Didona

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Retrieval-augmented generation improves, large language models', Retrieval-augmented generation, adding relevant retrieved, relevant retrieved text

备注

点击查看摘要

Abstract:Retrieval-augmented generation improves large language models' accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using different techniques. We make two main contributions. First, we show that existing CLC approaches have fundamental limitations that limit their accuracy or their applicability. We back this conclusion with an extensive CLC system experimental evaluation. Second, we observe that existing CLC techniques are complementary. We leverage this insight to propose a new CLC design that carefully combines them and achieves better accuracy.

139. 【2603.20217】Expected Reward Prediction, with Applications to Model Routing

链接https://arxiv.org/abs/2603.20217

作者:Kenan Hasanaliyev,Silas Alberti,Jenny Hamer,Dheeraj Rajagopal,Kevin Robinson,Jasper Snoek,Victor Veitch,Alexander Nicholas D'Amour

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Reward, expected reward, Reward models, standard tool, models

备注: ICML 2025 Workshop on Models of Human Feedback for AI Alignment

点击查看摘要

Abstract:Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model's suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction--based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt's category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.

140. 【2603.20216】Locally Coherent Parallel Decoding in Diffusion Language Models

链接https://arxiv.org/abs/2603.20216

作者:Michael Hersche,Nicolas Menet,Ronan Tanios,Abbas Rahimi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:alternative to autoregressive, promising alternative, offering sub-linear generation, offering sub-linear, Achieving sub-linear latency

备注

点击查看摘要

Abstract:Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

141. 【2603.20215】Multi-Agent Debate with Memory Masking

链接https://arxiv.org/abs/2603.20215

作者:Hongduan Tian,Xiao Feng,Ziyuan Zhao,Xiangyu Zhu,Rolan Yan,Bo Han

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, recently demonstrated impressive, Large language, LLM reasoning frameworks, demonstrated impressive capabilities

备注: ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, *multi-agent debate with memory masking* (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.

142. 【2603.20213】AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

链接https://arxiv.org/abs/2603.20213

作者:Jiaqi Yuan,Jialu Wang,Zihan Wang,Qingyun Sun,Ruijie Wang,Jianxin Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:Large Language Model, Large Language, traditional ranking-based retrieval, transforming optimization goals, retrieval to Large

备注

点击查看摘要

Abstract:Generative search engines represent a transition from traditional ranking-based retrieval to Large Language Model (LLM)-based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and attribution in black-box summarized outputs by strategically manipulating source content. However, existing methods rely on static heuristics, single-prompt optimization, or engine preference rule distillation that is prone to overfitting. They cannot flexibly adapt to diverse content or the changing behaviors of generative engines. Moreover, effectively optimizing these strategies requires an impractical amount of interaction feedback from the engines. To address these challenges, we propose AgenticGEO, a self-evolving agentic framework formulating optimization as a content-conditioned control problem, which enhances intrinsic content quality to robustly adapt to the unpredictable behaviors of black-box engines. Unlike fixed-strategy methods, AgenticGEO employs a MAP-Elites archive to evolve diverse, compositional strategies. To mitigate interaction costs, we introduce a Co-Evolving Critic, a lightweight surrogate that approximates engine feedback for content-specific strategy selection and refinement, efficiently guiding both evolutionary search and inference-time planning. Through extensive in-domain and cross-domain experiments on two representative engines, AgenticGEO achieves state-of-the-art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets. Our code and model are available at: this https URL.

143. 【2603.20212】Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

链接https://arxiv.org/abs/2603.20212

作者:Jiayun Wu,Peixu Hou,Shan Qu,Peng Zhang,Ning Gu,Tun Lu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Human Feedback, aligning Large Language, Large Language Models, aligning Large, Generative Reward Models

备注

点击查看摘要

Abstract:Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2603.20212 [cs.CL]

(or
arXiv:2603.20212v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.20212

Focus to learn more

              arXiv-issued DOI via DataCite</p>
144. 【2603.20210】CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language

链接https://arxiv.org/abs/2603.20210

作者:Roy Uziel,Omer Belhasin,Itay Levi,Akhiad Bercovich,Ran El-Yaniv,Ran Zilberstein,Michael Elad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Masked Diffusion Models, efficient non-causal alternative, semantic incoherence due, Masked Diffusion, Diffusion Models

备注

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL (Continuous and Robust Conditioned Diffusion for Language), a unified fine-tuning approach that jointly trains an encoder-demasker architecture, grounding the MDM demasking in continuous latent representations. This leads to the formation of a novel autoencoder in which decoding is obtained by an MDM algorithm. Relying on the same framework, we introduce two unconditional text synthesis algorithms: Continuous-Then-Discrete (ConThenDisc), a hybrid-diffusion approach that first generates latent representations in continuous space and then decodes these to tokens via an MDM, and Continuous-Within-Discrete (ConWithinDisc), a multi-diffusion strategy that refines latent representations throughout the discrete sampling process. Experiments using LLaDA show that our methods achieve superior generation quality and more than 10x faster sampling speeds in an unconditional setting.

145. 【2603.20209】Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

链接https://arxiv.org/abs/2603.20209

作者:Hengwei Ye,Yuanting Guan,Yuxuan Ge,Tianying Zhu,Zhenhan Guan,Yijia Zhong,Yijing Zhang,Han Zhang,Yingna Wu,Zheng Tian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, process multimodal data, Large Language, Multimodal Large

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs' adaptability and developmental potential, mirroring the stages of children's cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: this https URL.

146. 【2603.20208】RedacBench: Can AI Erase Your Secrets?

链接https://arxiv.org/abs/2603.20208

作者:Hyunjun Jeon,Kyuyoung Kim,Jinwoo Shin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Modern language models, Modern language, readily extract sensitive, readily extract, Modern

备注

点击查看摘要

Abstract:Modern language models can readily extract sensitive information from unstructured text, making redaction -- the selective removal of such information -- critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy-conditioned redaction across domains and strategies. Constructed from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model's ability to selectively remove policy-violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security -- the removal of sensitive propositions -- and utility -- the preservation of non-sensitive propositions. Experiments across multiple redaction strategies and state-of-the-art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web-based playground for dataset customization and evaluation. Available at this https URL.

147. 【2603.20206】Enhancing Safety of Large Language Models via Embedding Space Separation

链接https://arxiv.org/abs/2603.20206

作者:Xu Zhao,Xiting Wang,Weiran Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, harmful prompts remains, Large language, achieved impressive capabilities, Embedding Space Separation

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. Motivated by this observation, we propose a representation-level fine-tuning approach, named Embedding Space Separation (ES2), which improves LLM safety by explicitly enlarging the distance between harmful and safe representations in the embedding space. To prevent degradation of model's general capabilities, we introduce a Kullback-Leibler (KL) divergence regularization term into the loss function, which constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs. We evaluate our method on several open-source LLMs using standard safety benchmarks. Extensive experimental results demonstrate that our approach substantially improves model safety while maintaining comparable general capabilities.

148. 【2603.21875】Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

链接https://arxiv.org/abs/2603.21875

作者:Xi Xuan,Wenxin Zhang,Zhiyu Li,Jennifer Williams,Ville Hautamäki,Tomi H. Kinnunen

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:synthetic speech utterances, speech utterances originate, Speech deepfake source, verification systems aims, Speech deepfake

备注: Submitted to Interspeech 2026; The code, evaluation protocols and demo website are available at [this https URL](https://github.com/xxuan-acoustics/RiemannSD-Net)

点击查看摘要

Abstract:Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at this https URL.

149. 【2603.21576】PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

链接https://arxiv.org/abs/2603.21576

作者:Hyoseok Park,Yeonsang Park

类目:Optics (physics.optics); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Long-context LLM inference, Long-context LLM, LLM inference, memory bandwidth cost, inference is bottlenecked

备注: 28 pages, 27 figures, 15 tables, including supplementary material. Code available at [this https URL](https://github.com/hyoseokp/PRISM)

点击查看摘要

Abstract:Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n = 4K).

150. 【2603.21342】Generalized Discrete Diffusion from Snapshots

链接https://arxiv.org/abs/2603.21342

作者:Oussama Zekri,Théo Uscidda,Nicolas Boullé,Anna Korba

类目:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:introduce Generalized Discrete, Generalized Discrete Diffusion, discrete state spaces, large discrete state, introduce Generalized

备注: 37 pages, 6 figures, 13 tables

点击查看摘要

Abstract:We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{this https URL}{this https URL}.

151. 【2603.21073】SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

链接https://arxiv.org/abs/2603.21073

作者:Jianyi Chen,Rongxiu Zhong,Shilei Zhang,Kun Qian,Jinglei Liu,Yike Guo,Wei Xue

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Composing coherent long-form, significant challenge due, modeling long-range dependencies, Composing coherent, lengthy audio representations

备注: Under Review

点击查看摘要

Abstract:Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at this https URL.

152. 【2603.20321】GIP-RAG: An Evidence-Grounded Retrieval-Augmented Framework for Interpretable Gene Interaction and Pathway Impact Analysis

链接https://arxiv.org/abs/2603.20321

作者:Fujian Jia,Jiwen Gu,Cheng Lu,Dezhi Zhao,Mengjiang Huang,Yuanzhi Lu,Xin Liu,Kang Liu

类目:Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:advancing precision medicine, Understanding mechanistic relationships, elucidating disease mechanisms, Understanding mechanistic, precision medicine

备注: 29 pages

点击查看摘要

Abstract:Understanding mechanistic relationships among genes and their impacts on biological pathways is essential for elucidating disease mechanisms and advancing precision medicine. Despite the availability of extensive molecular interaction and pathway data in public databases, integrating heterogeneous knowledge sources and enabling interpretable multi-step reasoning across biological networks remain challenging. We present GIP-RAG (Gene Interaction Prediction through Retrieval-Augmented Generation), a computational framework that combines biomedical knowledge graphs with large language models (LLMs) to infer and interpret gene interactions. The framework constructs a unified gene interaction knowledge graph by integrating curated data from KEGG, WikiPathways, SIGNOR, Pathway Commons, and PubChem. Given user-specified genes, a query-driven module retrieves relevant subgraphs, which are incorporated into structured prompts to guide LLM-based stepwise reasoning. This enables identification of direct and indirect regulatory relationships and generation of mechanistic explanations supported by biological evidence. Beyond pairwise interactions, GIP-RAG includes a pathway-level functional impact module that simulates propagation of gene perturbations through signaling networks and evaluates potential pathway state changes. Evaluation across diverse biological scenarios demonstrates that the framework generates consistent, interpretable, and evidence-supported insights into gene regulatory mechanisms. Overall, GIP-RAG provides a general and interpretable approach for integrating knowledge graphs with retrieval-augmented LLMs to support mechanistic reasoning in complex molecular systems.

Comments:
29 pages

Subjects:

Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2603.20321 [q-bio.MN]

(or
arXiv:2603.20321v1 [q-bio.MN] for this version)

https://doi.org/10.48550/arXiv.2603.20321

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Mengjiang Huang [view email] [v1]
Thu, 19 Mar 2026 23:36:26 UTC (2,942 KB)

信息检索

1. 【2603.22231】One Model, Two Markets: Bid-Aware Generative Recommendation

链接https://arxiv.org/abs/2603.22231

作者:Yanchen Jiang,Zhe Feng,Christopher P. Mah,Aranyak Mehta,Di Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)

关键词:Generative Recommender Systems, Recommender Systems, widely adopted competitive, adopted competitive paradigm, Generative Recommender

备注

点击查看摘要

Abstract:Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad's likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.

2. 【2603.22073】PreferRec: Learning and Transferring Pareto Preferences for Multi-objective Re-ranking

链接https://arxiv.org/abs/2603.22073

作者:Wei Zhou,Wuyang Li,Junkai Ji,Xueliang Li,Wenjing Hong,Zexuan Zhu,Xing Tang,Xiuqiang He

类目:Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)

关键词:multi-stage recommender systems, modern multi-stage recommender, recommender systems, modern multi-stage, multi-stage recommender

备注

点击查看摘要

Abstract:Multi-objective re-ranking has become a critical component of modern multi-stage recommender systems, as it tasked to balance multiple conflicting objectives such as accuracy, diversity, and fairness. Existing multi-objective re-ranking methods typically optimize aggregate objectives at the item level using static or handcrafted preference weights. This design overlooks that users inherently exhibit Pareto-optimal preferences at the intent level, reflecting personalized trade-offs among objectives rather than fixed weight combinations. Moreover, most approaches treat re-ranking task for each user as an isolated problem, and repeatedly learn the preferences from scratch. Such a paradigm not only incurs high computational cost, but also ignores the fact that users often share similar preference trade-off structures across objectives. Inspired by the existence of homogeneous multi-objective optimization spaces where Pareto-optimal patterns are transferable, we propose PreferRec, a novel framework that explicitly models and transfers Pareto preferences across users. Specifically, PreferRec is built upon three tightly coupled components: Preference-Aware Pareto Learning aims to capture user intrinsic trade-offs among multiple conflicting objectives at the intent level. By learning Pareto preference representations from re-ranking populations, this component explicitly models how users prioritize different objectives under diverse contexts. Knowledge-Guided Transfer facilitates efficient cross-user knowledge transfer by distilling shared optimization patterns across homogeneous optimization spaces. The transferred knowledge is then used to guide solution selection and personalized re-ranking, biasing the optimization process toward high-quality regions of the Pareto front while preserving user-specific preference characteristics.

3. 【2603.22008】On the Challenges and Opportunities of Learned Sparse Retrieval for Code

链接https://arxiv.org/abs/2603.22008

作者:Simon Lupart,Maxime Louis,Thibault Formal,Hervé Déjean,Stéphane Clinchant

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:software engineering systems, modern LLM-based software, LLM-based software engineering, engineering systems, large codebases

备注: 15 pages, 5 figures, 12 tables

点击查看摘要

Abstract:Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

4. 【2603.21886】ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

链接https://arxiv.org/abs/2603.21886

作者:Zhuocheng Zhang,Xingwu Zhang,Kangheng Liang,Guanxuan Li,Richard Mccreadie,Zijun Long

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, resulting in increased, increased effectiveness, textual information, Recent

备注

点击查看摘要

Abstract:Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.

5. 【2603.21871】GoogleTrendArchive: A Year-Long Archive of Real-Time Web Search Trends Worldwide

链接https://arxiv.org/abs/2603.21871

作者:Aleksandra Urman,Anikó Hannák,Joachim Baumann

类目:Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:Trending, November, January, Google Trending, Google

备注: Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)

点击查看摘要

Abstract:GoogleTrendArchive is a comprehensive archive of Google Trending Now data spanning over one year (from November 28, 2024 to January 3, 2026) across 125 countries and 1,358 locations. Unlike Google Trends, which requires specifying search terms in advance, Trending Now captures search queries experiencing real-time surges, offering a way to inductively discover trending patterns across regions for studying collective attention dynamics. However, Google does not provide historical access to this data beyond seven days. Our dataset addresses this gap by presenting an archive of Trending Now data. The dataset contains over 7.6 million trend episodes. Each record includes the trend identifier, search volume bucket, precise timestamps, duration, geographic location, and related query clusters. This dataset, among other, enables systematic studies of information diffusion patterns, cross-cultural attention dynamics, crisis responses, and the temporal evolution of collective information-seeking at a global scale. The comprehensive geographic coverage facilitates fine-grained cross-country or cross-regional comparative analyses.

6. 【2603.21613】AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents

链接https://arxiv.org/abs/2603.21613

作者:Tianyi Li,Zixuan Wang,Guidong Lei,Xiaodong Li,Hui Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models offer, Large Language, Language Models, Recommender agents built

备注

点击查看摘要

Abstract:Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.

7. 【2603.21582】Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track

链接https://arxiv.org/abs/2603.21582

作者:Deepak Gupta,Dina Demner-Fushman,William Hersh,Steven Bedrick,Kirk Roberts

类目:Information Retrieval (cs.IR)

关键词:made significant progress, Recent advances, large language models, multiple biomedical tasks, clinical note summarization

备注

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have made significant progress across multiple biomedical tasks, including biomedical question answering, lay-language summarization of the biomedical literature, and clinical note summarization. These models have demonstrated strong capabilities in processing and synthesizing complex biomedical information and in generating fluent, human-like responses. Despite these advancements, hallucinations or confabulations remain key challenges when using LLMs in biomedical and other high-stakes domains. Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research. Studies on the evaluation of the LLMs' abilities to ground generated statements in verifiable sources have shown that models perform significantly

8. 【2603.21564】oward a Theory of Hierarchical Memory for Language Agents

链接https://arxiv.org/abs/2603.21564

作者:Yashar Talebirad,Ali Parsaee,Csongor Y. Szepesvari,Amirhossein Nadiri,Osmar Zaiane

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Social and Information Networks (cs.SI)

关键词:address context-length limitations, adding hierarchical memory, build multi-level representatives, agentic systems address, systems address context-length

备注

点击查看摘要

Abstract:Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction ($\alpha$) maps raw data to atomic information units; coarsening ($C = (\pi, \rho)$) partitions units and assigns a representative to each group; and traversal ($\tau$) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function $\rho$ and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality.

9. 【2603.21481】agLLM: A Fine-Grained Tag Generation Approach for Note Recommendation

链接https://arxiv.org/abs/2603.21481

作者:Zhijian Chen,Likai Wang,Lei Chen,Yaguang Dou,Jialiang Shi,Tian Qi,Dongdong Hao,Mengying Lu,Cheng Ye,Chao Wei

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, E-commerce community recommendation, shown promising potential, potential in E-commerce

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising potential in E-commerce community recommendation. While LLMs and Multimodal LLMs (MLLMs) are widely used to encode notes into implicit embeddings, leveraging their generative capabilities to represent notes with interpretable tags remains unexplored. In the field of tag generation, traditional close-ended methods heavily rely on the design of tag pools, while existing open-ended methods applied directly to note recommendations face two limitations: (1) MLLMs lack guidance during generation, resulting in redundant tags that fail to capture user interests; (2) The generated tags are often coarse and lack fine-grained representation of notes, interfering with downstream recommendations. To address these limitations, we propose TagLLM, a fine-grained tag generation method for note recommendation. TagLLM captures user interests across note categories through a User Interest Handbook and constructs fine-grained tag data using multimodal CoT Extraction. A Tag Knowledge Distillation method is developed to equip small models with competitive generation capabilities, enhancing inference efficiency. In online A/B test, TagLLM increases average view duration per user by 0.31%, average interactions per user by 0.96%, and page view click-through rate in cold-start scenario by 32.37%, demonstrating its effectiveness.

10. 【2603.21460】When Documents Disagree: Measuring Institutional Variation in Transplant Guidance with Retrieval-Augmented Language Models

链接https://arxiv.org/abs/2603.21460

作者:Yubo Li,Ramayya Krishnan,Rema Padman

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:solid-organ transplantation vary, transplantation vary substantially, systematic method exists, solid-organ transplantation, transplantation vary

备注

点击查看摘要

Abstract:Patient education materials for solid-organ transplantation vary substantially across U.S. centers, yet no systematic method exists to quantify this heterogeneity at scale. We introduce a framework that grounds the same patient questions in different centers' handbooks using retrieval-augmented language models and compares the resulting answers using a five-label consistency taxonomy. Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center. We find that 20.8% of non-absent pairwise comparisons exhibit clinically meaningful divergence, concentrated in condition monitoring and lifestyle topics. Coverage gaps are even more prominent: 96.2% of question-handbook pairs miss relevant content, with reproductive health at 95.1% absence. Center-level divergence profiles are stable and interpretable, where heterogeneity reflects systematic institutional differences, likely due to patient diversity. These findings expose an information gap in transplant patient education materials, with document-grounded medical question answering highlighting opportunities for content improvement.

11. 【2603.21437】Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

链接https://arxiv.org/abs/2603.21437

作者:Hang Gao,Dimitris N. Metaxas

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:enabling efficient similarity, efficient similarity search, inducing well-known geometric, well-known geometric pathologies, Transformer-based embedding models

备注

点击查看摘要

Abstract:Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2603.21437 [cs.CL]

(or
arXiv:2603.21437v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.21437

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
12. 【2603.21329】COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding

链接https://arxiv.org/abs/2603.21329

作者:Xiaozhe Li,Tianyi Lyu,Siyi Yang,Yizhao Yang,Yuxi Gong,Jinxuan Huang,Ligao Zhang,Zhuoyi Huang,Qingwen Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, challenge for Large, requiring sophisticated reasoning, high-level cognitive challenge

备注

点击查看摘要

Abstract:Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.

13. 【2603.21248】Graph Fusion Across Languages using Large Language Models

链接https://arxiv.org/abs/2603.21248

作者:Kaung Myat Kyaw,Khush Agarwal,Jonathan Chan

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:persistent challenge due, Large Language Models, Combining multiple knowledge, Combining multiple, linguistic boundaries

备注

点击查看摘要

Abstract:Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.

14. 【2603.21243】LSA: A Long-Short-term Aspect Interest Transformer for Aspect-Based Recommendation

链接https://arxiv.org/abs/2603.21243

作者:Le Liu,Junrui Liu,Yunhan Gao,Ziheng Wang,Tong Li

类目:Information Retrieval (cs.IR)

关键词:personalized recommender systems, fine-grained user preferences, recommendation methods extract, methods extract aspect, aspect terms

备注: WISE2025

点击查看摘要

Abstract:Aspect-based recommendation methods extract aspect terms from reviews, such as price, to model fine-grained user preferences on items, making them a critical approach in personalized recommender systems. Existing methods utilize graphs to represent the relationships among users, items, and aspect terms, modeling user preferences based on graph neural networks. However, they overlook the dynamic nature of user interests - users may temporarily focus on aspects they previously paid little attention to - making it difficult to assign accurate weights to aspect terms for each user-item interaction. In this paper, we propose a long-short-term aspect interest Transformer (LSA) for aspect-based recommendation, which effectively captures the dynamic nature of user preferences by integrating both long-term and short-term aspect interests. Specifically, the short-term interests model the temporal changes in the importance of recently interacted aspect terms, while the long-term interests consider global behavioral patterns, including aspects that users have not interacted with recently. Finally, LSA combines long- and short-term interests to evaluate the importance of aspects within the union of user and item aspect neighbors, therefore accurately assigns aspect weights for each user-item interaction. Experiments conducted on four real-world datasets demonstrate that LSA improves MSE by 2.55% on average over the best baseline.

15. 【2603.21209】MI-DPG: Decomposable Parameter Generation Network Based on Mutual Information for Multi-Scenario Recommendation

链接https://arxiv.org/abs/2603.21209

作者:Wenzhuo Cheng,Ke Ding,Xin Dong,Yong He,Liang Zhang,Linjian Mo

类目:Information Retrieval (cs.IR)

关键词:Conversion rate, model, advertising systems, play a vital, vital role

备注: Accepted by CIKM 2023

点击查看摘要

Abstract:Conversion rate (CVR) prediction models play a vital role in recommendation and advertising systems. Recent research on multi-scenario recommendation shows that learning a unified model to serve multiple scenarios is effective for improving overall performance. However, it remains challenging to improve model prediction performance across scenarios at low model parameter cost, and current solutions are hard to robustly model multi-scenario diversity. In this paper, we propose MI-DPG for the multi-scenario CVR prediction, which learns scenario-conditioned dynamic model parameters for each scenario in a more efficient and effective manner. Specifically, we introduce an auxiliary network to generate scenario-conditioned dynamic weighting matrices, which are obtained by combining decomposed scenario-specific and scenario-shared low-rank matrices with parameter efficiency. For each scene, weighting the backbone model parameters by the weighting matrix helps to specialize the model parameters for different scenarios. It can not only modulate the complete parameter space of the backbone model but also improve the model effectiveness. Furthermore, we design a mutual information regularization to enhance the diversity of model parameters across different scenarios by maximizing the mutual information between the scenario-aware input and the scene-conditioned dynamic weighting matrix. Experiments from three real-world datasets show that MI-DPG significantly outperforms previous multi-scenario recommendation models.

16. 【2603.21188】Ontology-Compliant Knowledge Graphs

链接https://arxiv.org/abs/2603.21188

作者:Zhangcheng Qiang

类目:Information Retrieval (cs.IR)

关键词:constructing knowledge graphs, Ontologies can act, offering explainability, knowledge graphs, schema for constructing

备注: 12 pages

点击查看摘要

Abstract:Ontologies can act as a schema for constructing knowledge graphs (KGs), offering explainability, interoperability, and reusability. We explore \emph{ontology-compliant} KGs, aiming to build both internal and external ontology compliance. We discuss key tasks in ontology compliance and introduce our novel term-matching algorithms. We also propose a \emph{pattern-based compliance} approach and novel compliance metrics. The building sector is a case study to test the validity of ontology-compliant KGs. We recommend using ontology-compliant KGs to pursue automatic matching, alignment, and harmonisation of heterogeneous KGs.

17. 【2603.21139】Ontology-driven personalized information retrieval for XML documents

链接https://arxiv.org/abs/2603.21139

作者:Ounnaci Iddir,Ahmed-ouamer Rachid,Tai Dinh

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:eXtensible Markup Language, semi-structured eXtensible Markup, Markup Language, eXtensible Markup, improving information retrieval

备注

点击查看摘要

Abstract:This paper addresses the challenge of improving information retrieval from semi-structured eXtensible Markup Language (XML) documents. Traditional information retrieval systems (IRS) often overlook user-specific needs and return identical results for the same query, despite differences in users' knowledge, preferences, and objectives. We integrate external semantic resources, namely a domain ontology and user profiles, into the retrieval process. Documents, queries, and user profiles are represented as vectors of weighted concepts. The ontology applies a concept-weighting mechanism that emphasizes highly specific concepts, as lower-level nodes in the hierarchy provide more precise and targeted information. Relevance is assessed using semantic similarity measures that capture conceptual relationships beyond keyword matching, enabling personalized and fine-grained matching among user profiles, queries, and documents. Experimental results show that combining ontologies with user profiles improves retrieval effectiveness, achieving higher precision and recall than keyword-based approaches. Overall, the proposed framework enhances the relevance and adaptability of XML search results, supporting more user-centered retrieval.

18. 【2603.21024】Query, Decompose, Compress: Structured Query Expansion for Efficient Multi-Hop Retrieval

链接https://arxiv.org/abs/2603.21024

作者:JungMin Yun,YoungBin Kim

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, Language Models, increasingly employed, Large

备注: Accepted to CIKM 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have been increasingly employed for query expansion. However, their generative nature often undermines performance on complex multi-hop retrieval tasks by introducing irrelevant or noisy information. To address this challenge, we propose DeCoR (Decompose and Compress for Retrieval), a framework grounded in structured information refinement. Rather than generating additional content, DeCoR strategically restructures the query's underlying reasoning process and distills supporting evidence from retrieved documents. It consists of two core components tailored to the challenges of multi-hop retrieval: (1) Query Decomposition, which decomposes a complex query into explicit reasoning steps, and (2) Query-aware Document Compression, which synthesizes dispersed evidence from candidate documents into a concise summary relevant to the query. This structured design ensures that the final query representation remains both robust and comprehensive. Experimental results demonstrate that, despite utilizing a relatively small LLM, DeCoR outperforms strong baselines that rely on larger models. This finding underscores that, in complex retrieval scenarios, sophisticatedly leveraging the reasoning and summarization capabilities of LLMs offers a more efficient and effective solution than relying solely on their generative capability.

19. 【2603.21018】DSL-R1: From SQL to DSL for Training Retrieval Agents across Structured and Unstructured Data with Reinforcement Learning

链接https://arxiv.org/abs/2603.21018

作者:Yunhai Hu,Junwei Zhou,Yumo Cao,Yitao Long,Yiwei Xu,Qiyi Jiang,Weiyao Wang,Xiaoyu Cao,Zhen Sun,Yiran Zou,Nan Du

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

关键词:complex domains requires, domains requires bridging, unstructured content, complex domains, domains requires

备注

点击查看摘要

Abstract:Effective retrieval in complex domains requires bridging the gap between structured metadata and unstructured content. Existing systems typically isolate these capabilities, relying on either symbolic filtering or vector similarity, failing to capture their interplay. In this work, we propose DSL-R1, a unified framework that synergizes logical reasoning with semantic matching via a novel Domain-Specific Language (DSL). By embedding vector primitives within SQL-style operators, our approach leverages the complementary strengths of symbolic precision and semantic coverage. We further introduce a reinforcement learning mechanism where rule-based execution feedback and retrieval quality rewards jointly optimize the DSL generation, balancing structural correctness and semantic alignment. Evaluations on a large-scale industrial email benchmark demonstrate that DSL-R1 achieves a +12.3% improvement in Hit@1/3, consistently outperforming decoupled baselines and establishing a robust paradigm for hybrid retrieval.

20. 【2603.21012】Consensus-Driven Group Recommendation on Sparse Explicit Feedback: A Collaborative Filtering and Choquet-Borda Aggregation Framework

链接https://arxiv.org/abs/2603.21012

作者:Anh Nguyen Van,Huy Ngo Hoang,Khoi Ngo Nguyen,Ngoc Pham Thi,Khanh Ngo Mai Bao,Quyen Nguyen Van

类目:Information Retrieval (cs.IR)

关键词:Group Recommender Systems, Recommender Systems, potentially conflicting preferences, supporting collective decision-making, Group Recommender

备注: Preprint. Under review for journal publication

点击查看摘要

Abstract:Group Recommender Systems (GRS) play an essential role in supporting collective decision-making among users with diverse and potentially conflicting preferences. However, achieving stable intra-group consensus becomes particularly challenging when only sparse userID-itemID-rating data are available and no demographic, contextual, or group-level information exists. This paper proposes a consensus-driven hybrid group recommendation framework that integrates neighborhood-based collaborative filtering with fuzzy aggregation to support agreement, fairness, and robustness under sparsity. A composite similarity measure, CBS (Combined Similarity), is derived from two enhanced similarity metrics introduced in prior work: a geometry-based measure that captures rating-pattern structure, and an uncertainty-aware measure that models belief, evidence, and disagreement in sparse co-rating contexts. This combination provides more stable estimation of missing ratings and supports consensus-oriented neighborhood construction. Candidate items are generated by merging per-user top-N predictions and further enriched using the Borda Count mechanism to mitigate skewed rating distributions and reinforce group-level agreement. Final group ratings are computed using the Choquet integral, which flexibly captures heterogeneous user influence while preserving fairness and supporting consensus formation. Experimental results on real-world datasets with different rating distributions show that the proposed method improves group-level consensus, satisfaction, and fairness, while maintaining a balanced level of novelty. Although the model does not rely on social information, its evaluation using trust-aware novelty measures indicates stable behavior in socially structured environments.

21. 【2603.20990】ECI: Effective Contrastive Information to Evaluate Hard-Negatives

链接https://arxiv.org/abs/2603.20990

作者:Aarush Sinha,Rahul Seetharaman,Aman Bansal

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:improving retrieval accuracy, Hard negatives play, documents yet non-relevant, Hard negatives, Effective Contrastive Information

备注

点击查看摘要

Abstract:Hard negatives play a critical role in training and fine-tuning dense retrieval models, as they are semantically similar to positive documents yet non-relevant, and correctly distinguishing them is essential for improving retrieval accuracy. However, identifying effective hard negatives typically requires extensive ablation studies involving repeated fine-tuning with different negative sampling strategies and hyperparameters, resulting in substantial computational cost. In this paper, we introduce ECI: Effective Contrastive Information , a theoretically grounded metric grounded in Information Theory and Information Retrieval principles that enables practitioners to assess the quality of hard negatives prior to model fine-tuning. ECI evaluates negatives by optimizing the trade-off between Information Capacity the logarithmic bound on mutual information determined by set size and Discriminative Efficiency, a harmonic balance of Signal Magnitude (Hardness) and Safety (Max-Margin). Unlike heuristic approaches, ECI strictly penalizes unsafe, false-positive negatives prevalent in generative methods. We evaluate ECI across hard-negative sets mined or generated using BM25, cross-encoders, and large language models. Our results demonstrate that ECI accurately predicts downstream retrieval performance, identifying that hybrid strategies (BM25+Cross-Encoder) offer the optimal balance of volume and reliability, significantly reducing the need for costly end-to-end ablation studies.

22. 【2603.20939】User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

链接https://arxiv.org/abs/2603.20939

作者:Yuren Hao,Shuhaib Mehri,ChengXiang Zhai,Dilek Hakkani-Tür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (stat.ML)

关键词:repeatedly restate preferences, personal assistants, lack a persistent, repeatedly restate, Retrieval Scoring

备注: 21 pages including appendices

点击查看摘要

Abstract:Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at this https URL.

23. 【2603.20882】RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation

链接https://arxiv.org/abs/2603.20882

作者:Kaustubh D. Dhole,Eugene Agichtein

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, output scalar scores, increasingly evaluated, output scalar

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated and sometimes trained using automated graders such as LLM-as-judges that output scalar scores or preferences. While convenient, these approaches are often opaque: a single score rarely explains why an answer is good or bad, which requirements were missed, or how a system should be improved. This lack of interpretability limits their usefulness for model development, dataset curation, and high-stakes deployment. Query-specific rubric-based evaluation offers a more transparent alternative by decomposing quality into explicit, checkable criteria. However, manually designing high-quality, query-specific rubrics is labor-intensive and cognitively demanding and not feasible for deployment. While previous approaches have focused on generating intermediate rubrics for automated downstream evaluation, it is unclear if these rubrics are both interpretable and effective for human users. In this work, we investigate whether LLMs can generate useful, instance-specific rubrics as compared to human-authored rubrics, while also improving effectiveness for identifying good responses. Through our systematic study on two rubric benchmarks, and on multiple few-shot and post-training strategies, we find that off-the-shelf LLMs produce rubrics that are poorly aligned with human-authored ones. We introduce a simple strategy, RubricRAG, which retrieves domain knowledge via rubrics at inference time from related queries. We demonstrate that RubricRAG can generate more interpretable rubrics both for similarity to human-authored rubrics, and for improved downstream evaluation effectiveness. Our results highlight both the challenges and a promising approach of scalable, interpretable evaluation through automated rubric generation.

24. 【2603.20723】Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok

链接https://arxiv.org/abs/2603.20723

作者:Branislav Pecher,Adrian Bindas,Jan Jakubcik,Matus Tuna,Matus Tibensky,Simon Liska,Peter Sakalik,Andrej Suty,Matej Mosnar,Filip Hossner,Ivan Srba

类目:Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:Social media platforms, Social media, everyday life, integral part, part of everyday

备注

点击查看摘要

Abstract:Social media platforms have become an integral part of everyday life, serving as a primary source of news and information for many users. These platforms increasingly rely on personalised recommendation systems that shape what users see and engage with. While these systems are optimised for engagement, concerns have emerged that they may also drive users toward more polarised perspectives, particularly in contested domains such as politics, climate change, vaccines, and conspiracy theories. In this paper, we present an algorithmic audit of personalisation drift on TikTok in these polarising topics. Using controlled accounts designed to simulate users with interests aligned with or opposed to different polarising topics, we systematically measure the extent to which TikTok steers content exposure toward specific topics and polarities over time. Specifically, we investigated: 1) a preference-aligned drift (showing a strong personalisation towards user interests), 2) a polarisation-topic drift (showing a strong neutralising effect for misinformation-themed topics, and a high preference and reinforcement of interest of US politic topic); and 3) a polarisation-stance drift (showing a preference of oppose stance towards US politics topic and a general reinforcement of users' stance by recommending items aligned with their stance towards polarising topics). Overall, our findings provide evidence that recommendation trajectories differ markedly across topics, with some pathways amplifying polarised viewpoints more strongly than others and offer insights for platform governance, transparency and user awareness.

25. 【2603.20704】NDT: Non-Differential Transformer and Its Application to Sentiment Analysis

链接https://arxiv.org/abs/2603.20704

作者:Soudeep Ghoshal,Himanshu Buckchash,Sarita Paudel,Rubén Ruiz-Torrubiano

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:understanding human sentiment, social media, understanding human, meaningfully with people, customer feedback

备注: 10 pages, 16 figures. Submitted to IEEE Transactions on Computational Social Systems

点击查看摘要

Abstract:From customer feedback to social media, understanding human sentiment in text is central to how machines can interact meaningfully with people. However, despite notable progress, accurately capturing sentiment remains a challenging task, which continues to motivate further research in this area. To this end, we introduce Non-Differential Transformer (NDT). It is inspired by (but in contrast to) the state-of-the-art Differential Transformer (DT) model. While standard Transformers can struggle with irrelevant context, the sota DT model uses attention map subtraction, potentially for noise cancellation. We explore an alternative motivation, hypothesizing that benefits may arise from enabling different attention components to specialize on distinct concepts within the text, similar to multiplexing information channels or mixture models, rather than primarily canceling noise via subtraction. Guided by this concept-multiplexing (ConPlex) view, the specific architecture presented in this paper employs a purely additive strategy. It uses only positive weights, learned during training, to ensure constructive combination of these specialized attention perspectives. This design choice explores positive only integration, though our broader framework also shows promise with less constrained linear combinations involving both positive and negative weights. Our model computes attention via this positively weighted sum of multiple distinct attention maps. This allows the model to constructively integrate diverse signals and potentially capture more complex contextual relationships. Competitive performance is achieved by the proposed model for Sentiment Analysis while tested on multiple datasets. We conclude by presenting our results, challenges and future research agenda in this important area of research.

26. 【2603.20513】ReBOL: Retrieval via Bayesian Optimization with Batched LLM Relevance Observations and Query Reformulation

链接https://arxiv.org/abs/2603.20513

作者:Anton Korikov,Scott Sanner

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:enables contextual query-document, contextual query-document token, query-document token interactions, vector similarity, LLM-reranking is limited

备注

点击查看摘要

Abstract:LLM-reranking is limited by the top-k documents retrieved by vector similarity, which neither enables contextual query-document token interactions nor captures multimodal relevance distributions. While LLM query reformulation attempts to improve recall by generating improved or additional queries, it is still followed by vector similarity retrieval. We thus propose to address these top-k retrieval stage failures by introducing ReBOL, which 1) uses LLM query reformulations to initialize a multimodal Bayesian Optimization (BO) posterior over document relevance, and 2) iteratively acquires document batches for LLM query-document relevance scoring followed by posterior updates to optimize relevance. After exploring query reformulation and document batch diversification techniques, we evaluate ReBOL against LLM reranker baselines on five BEIR datasets and using two LLMs (Gemini-2.5-Flash-Lite, GPT-5.2). ReBOL consistently achieves higher recall and competitive rankings, for example compared to the best LLM reranker on the Robust04 dataset with 46.5% vs. 35.0% recall@100 and 63.6% vs. 61.2% NDCG@10. We also show that ReBOL can achieve comparable latency to LLM rerankers.

27. 【2603.20437】yProv4DV: Reproducible Data Visualization Scripts Out of the Box

链接https://arxiv.org/abs/2603.20437

作者:Gabriele Padovani,Sandro Fiore

类目:oftware Engineering (cs.SE); Information Retrieval (cs.IR)

关键词:execution context, resulting figures, critical phase, frequently shared, complete combination

备注: SoftwareX, 17 pages, 4 figures

点击查看摘要

Abstract:While results visualization is a critical phase to the communication of new academic results, plots are frequently shared without the complete combination of code, input data, execution context and outputs required to independently reproduce the resulting figures. Existing reproducibility solutions tend to focus on computational pipelines or workflow management systems, not covering script-based visualization practices commonly used by researchers and practitioners. Additionally, the minimalist nature of current Python data visualization libraries tend to speed up the creation of images, disincentivizing users from spending time integrating additional tools into these short scripts. This paper proposes yProv4DV, a library lightweight designed to enable reproducible data visualization scripts through the use of provenance information, minimizing the necessity for code modifications. Through a single call, users can track inputs, outputs and source code files, enabling saving and full reproducibility of their data visualization software. As a result, this library fills a gap in reproducible research workflows by addressing the reproducibility of plots in scientific publications.

28. 【2603.20422】PEARL: Personalized Streaming Video Understanding Model

链接https://arxiv.org/abs/2603.20422

作者:Yuanhong Zheng,Ruichuan An,Xiaopeng Lin,Yuxing Liu,Sihan Yang,Huanyu Zhang,Haodong Li,Qintong Zhang,Renrui Zhang,Guopeng Li,Yifan Zhang,Yuheng Li,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:memories over time, continuously recognize, identities and update, update our memories, Streaming Video Understanding

备注: Arxiv Submission

点击查看摘要

Abstract:Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at this https URL.

29. 【2603.20366】WebNavigator: Global Web Navigation via Interaction Graph Retrieval

链接https://arxiv.org/abs/2603.20366

作者:Xuanwang Zhang,Yuteng Han,Jinnan Qi,Mulong Xie,Zhen Wu,Xinyu Dai

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:current methods remain, current methods, complex web environments, Topological Blindness, significant advances

备注: 24 pages, 3 figures

点击查看摘要

Abstract:Despite significant advances in autonomous web navigation, current methods remain far from human-level performance in complex web environments. We argue that this limitation stems from Topological Blindness, where agents are forced to explore via trial-and-error without access to the global topological structure of the environment. To overcome this limitation, we introduce WebNavigator, which reframes web navigation from probabilistic exploration into deterministic retrieval and pathfinding. WebNavigator constructs Interaction Graphs via zero-token cost heuristic exploration offline and implements a Retrieve-Reason-Teleport workflow for global navigation online. WebNavigator achieves state-of-the-art performance on WebArena and OnlineMind2Web. On WebArena multi-site tasks, WebNavigator achieves a 72.9\% success rate, more than doubling the performance of enterprise-level agents. This work reveals that Topological Blindness, rather than model reasoning capabilities alone, is an underestimated bottleneck in autonomous web navigation.

30. 【2603.20338】Low-pass Personalized Subgraph Federated Recommendation

链接https://arxiv.org/abs/2603.20338

作者:Wooseok Sim,Hogun Park

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:sharing raw data, Federated Recommender Systems, Federated recommender system, training decentralized models, Federated Recommender

备注: Accepted at ICLR 2026. 31 pages, 3 figures, 12 tables

点击查看摘要

Abstract:Federated Recommender Systems (FRS) preserve privacy by training decentralized models on client-specific user-item subgraphs without sharing raw data. However, FRS faces a unique challenge: subgraph structural imbalance, where drastic variations in subgraph scale (user/item counts) and connectivity (item degree) misalign client representations, making it challenging to train a robust model that respects each client's unique structural characteristics. To address this, we propose a Low-pass Personalized Subgraph Federated recommender system (LPSFed). LPSFed leverages graph Fourier transforms and low-pass spectral filtering to extract low-frequency structural signals that remain stable across subgraphs of varying size and degree, allowing robust personalized parameter updates guided by similarity to a neutral structural anchor. Additionally, we leverage a localized popularity bias-aware margin that captures item-degree imbalance within each subgraph and incorporates it into a personalized bias correction term to mitigate recommendation bias. Supported by theoretical analysis and validated on five real-world datasets, LPSFed achieves superior recommendation accuracy and enhances model robustness.

31. 【2603.20336】GEM: A Native Graph-based Index for Multi-Vector Retrieval

链接https://arxiv.org/abs/2603.20336

作者:Yao Tian,Zhoujin Tian,Xi Zhao,Ruiyuan Zhang,Xiaofang Zhou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:improving retrieval quality, retrieval quality, queries and data, data are represented, single-vector approaches

备注: This paper has been accepted by SIGMOD 2026

点击查看摘要

Abstract:In multi-vector retrieval, both queries and data are represented as sets of high-dimensional vectors, enabling finer-grained semantic matching and improving retrieval quality over single-vector approaches. However, its practical adoption is held back by the lack of effective indexing algorithms. Existing work, attempting to reuse standard single-vector indexes, often fails to preserve multi-vector semantics or remains slow. In this work, we present GEM, a native indexing framework for multi-vector representations. The core idea is to construct a proximity graph directly over vector sets, preserving their fine-grained semantics while enabling efficient navigation. First, GEM designs a set-level clustering scheme. It associates each vector set with only its most informative clusters, effectively reducing redundancy without hurting semantic coverage. Then, it builds local proximity graphs within clusters and bridges them into a globally navigable structure. To handle the non-metric nature of multi-vector similarity, GEM decouples the graph construction metric from the final relevance score and injects semantic shortcuts to guide efficient navigation toward relevant regions. At query time, GEM launches beam search from multiple entry points and prunes paths early using cluster cues. To further enhance efficiency, a quantized distance estimation technique is used for both indexing and search. Across in-domain, out-of-domain, and multi-modal benchmarks, GEM achieves up to 16x speedup over state-of-the-art methods while matching or improving accuracy.

32. 【2603.20316】Bypassing Document Ingestion: An MCP Approach to Financial QA

链接https://arxiv.org/abs/2603.20316

作者:Sasan Mansouri,Edoardo Pilla,Mark Wahrenburg,Fabian Woebbeking

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:information retrieval problem, retrieval problem, Model Context Protocol, Abstract, financial

备注: 19 pages, 10 figures

点击查看摘要

Abstract:Answering financial questions is often treated as an information retrieval problem. In practice, however, much of the relevant information is already available in curated vendor systems, especially for quantitative analysis. We study whether, and under which conditions, Model Context Protocol (MCP) offers a more reliable alternative to standard retrieval-augmented generation (RAG) by allowing large language models (LLMs) to interact directly with data rather than relying on document ingestion and chunk retrieval. We test this by building a custom MCP server that exposes LSEG APIs as tools and evaluating it on the FinDER benchmark. The approach performs particularly well on the Financials subset, achieving up to 80.4% accuracy on multi-step numerical questions when relevant context is retrieved. The paper thus provides both a baseline for MCP-based financial question answering (QA) and evidence on where this approach breaks down, such as for questions requiring qualitative or document-specific context. Overall, direct access to curated data is a lightweight and effective alternative to document-centric RAG for quantitative financial QA, but not a substitute for all financial QA tasks.

33. 【2603.20309】BubbleRAG: Evidence-Driven Retrieval-Augmented Generation for Black-Box Knowledge Graphs

链接https://arxiv.org/abs/2603.20309

作者:Duyi Pan,Tianao Lou,Xin Li,Haoze Song,Yiwen Wu,Mengyi Deng,Mingyu Yang,Wei Wang

类目:Information Retrieval (cs.IR); Databases (cs.DB)

关键词:Large Language Models, Large Language, Language Models, exhibit hallucinations, hallucinations in knowledge-intensive

备注: Technical Report

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit hallucinations in knowledge-intensive tasks. Graph-based retrieval augmented generation (RAG) has emerged as a promising solution, yet existing approaches suffer from fundamental recall and precision limitations when operating over black-box knowledge graphs -- graphs whose schema and structure are unknown in advance. We identify three core challenges that cause recall loss (semantic instantiation uncertainty and structural path uncertainty) and precision loss (evidential comparison uncertainty). To address these challenges, we formalize the retrieval task as the Optimal Informative Subgraph Retrieval (OISR) problem -- a variant of Group Steiner Tree -- and prove it to be NP-hard and APX-hard. We propose BubbleRAG, a training-free pipeline that systematically optimizes for both recall and precision through semantic anchor grouping, heuristic bubble expansion to discover candidate evidence graphs (CEGs), composite ranking, and reasoning-aware expansion. Experiments on multi-hop QA benchmarks demonstrate that BubbleRAG achieves state-of-the-art results, outperforming strong baselines in both F1 and accuracy while remaining plug-and-play.

34. 【2603.20287】Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

链接https://arxiv.org/abs/2603.20287

作者:Aleksandra Edwards,Thomas Edwards,Jose Camacho-Collados,Alun Preece

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, Language Models, text generation tasks, text generation

备注: The paper has been accepted to LREC 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are extensively used in text generation tasks. These generative capabilities bring us to a point where LLMs could potentially provide useful insights in policy making or agency operations. In this paper, we introduce a new task consisting of generating recommendations which can be used to inform future actions and improvements of agencies work within private and public organisations. In particular, we present the first benchmark and coherent evaluation for developing recommendation systems to inform organisation policies. This task is clearly different from usual product or user recommendation systems, but rather aims at providing a basis to suggest policy improvements based on the conclusions drawn from reports. Our results demonstrate that state-of-the-art LLMs have the potential to emphasize and reflect on key issues and learning points within generated recommendations.

35. 【2603.20286】Rethinking Retrieval-Augmentation as Synthesis: A Query-Aware Context Merging Approach

链接https://arxiv.org/abs/2603.20286

作者:Jiarui Guo,Yuemeng Xu,Zongwei Lv,Yangyujia Wang,Xiaolin Wang,Kan Liu,Tao Lan,Lin Qu,Tong Yang

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, enables Large Language, Language Models, Large Language, dynamically incorporating external

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to extend their existing knowledge by dynamically incorporating external information. However, practical deployment is fundamentally constrained by the LLM's finite context window, forcing a trade-off between information sufficiency and token consumption. Standard pipelines address this via a retrieve-then-select strategy, typically retaining only the top-k chunks based on relevance. Nevertheless, this approach is suboptimal: it inherently truncates critical bridging evidence located in the long tail of the relevance distribution, while simultaneously wasting the token budget on semantically redundant high-ranking chunks. In this paper, we rethink retrieval-augmentation as a dynamic optimization problem aimed at maximizing information density. We propose MergeRAG, a novel framework that shifts the paradigm from static filtering to query-aware synthesis. MergeRAG employs a scoring agent to restructure retrieved contexts through a dual-pathway mechanism: 1) Symmetric Merging, which consolidates weak signals to recover lost bridging evidence; 2) Asymmetric Merging, which utilizes entropy-guided anchoring to eliminate redundancy without sacrificing semantic integrity. We further introduce a Hierarchical Parallel Merging strategy that mitigates information loss while maximizing computational parallelism. Extensive experiments on standard benchmarks demonstrate that MergeRAG significantly outperforms state-of-the-art RAG baselines, achieving up to 13.7 points improvement in F1 score and 11.5 points in Exact Match (EM), respectively.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2603.20286 [cs.IR]

(or
arXiv:2603.20286v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.20286

Focus to learn more

              arXiv-issued DOI via DataCite</p>
36. 【2603.20283】FastPFRec: A Fast Personalized Federated Recommendation with Secure Sharing

链接https://arxiv.org/abs/2603.20283

作者:Zhenxing Yan,Jidong Yuan,Yongqi Sun,Haiyang Liu,Zhihui Gao

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Graph neural network, systems effectively capture, effectively capture user-item, capture user-item relationships, recommendation systems effectively

备注

点击查看摘要

Abstract:Graph neural network (GNN)-based federated recommendation systems effectively capture user-item relationships while preserving data privacy. However, existing methods often face slow convergence on graph data and privacy leakage risks during collaboration. To address these challenges, we propose FastPFRec (Fast Personalized Federated Recommendation with Secure Sharing), a novel framework that enhances both training efficiency and data security. FastPFRec accelerates model convergence through an efficient local update strategy and introduces a privacy-aware parameter sharing mechanism to mitigate leakage risks. Experiments on four real-world datasets (Yelp, Kindle, Gowalla-100k, and Gowalla-1m) show that FastPFRec achieves 32.0% fewer training rounds, 34.1% shorter training time, and 8.1% higher accuracy compared with existing baselines. These results demonstrate that FastPFRec provides an efficient and privacy-preserving solution for scalable federated recommendation.

37. 【2603.20278】OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

链接https://arxiv.org/abs/2603.20278

作者:Zhuofeng Li,Dongfu Jiang,Xueguang Ma,Haoxiang Zhang,Ping Nie,Yuyu Zhang,Kai Zou,Jianwen Xie,Yu Zhang,Wenhu Chen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Training deep research, evidence aggregation, multi-step reasoning, Training deep, agents requires long-horizon

备注

点击查看摘要

Abstract:Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at this https URL.

计算机视觉

1. 【2603.22286】WorldCache: Content-Aware Caching for Accelerated Video World Models

链接https://arxiv.org/abs/2603.22286

作者:Umair Nawaz,Ahmed Heakl,Ufaq Khan,Abdelrahman Shaker,Salman Khan,Fahad Shahbaz Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:power high-fidelity video, costly spatio-temporal attention, high-fidelity video world, video world models, remain computationally expensive

备注: 33 Pages

点击查看摘要

Abstract:Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{this https URL}{World-Cache}.

2. 【2603.22285】VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

链接https://arxiv.org/abs/2603.22285

作者:Ruoliu Yang,Chu Wu,Caifeng Shan,Ran He,Chaoyou Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Long video understanding, large language models, limited context windows, understanding remains challenging, multimodal large language

备注

点击查看摘要

Abstract:Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at this https URL

3. 【2603.22283】End-to-End Training for Unified Tokenization and Latent Denoising

链接https://arxiv.org/abs/2603.22283

作者:Shivam Duggal,Xingjian Bai,Zongze Wu,Richard Zhang,Eli Shechtman,Antonio Torralba,Phillip Isola,William T. Freeman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:learned latent spaces, enable high-fidelity synthesis, Generative Encoder, latent space, high-fidelity synthesis

备注: First two authors contributed equally. Project: [this https URL](https://xingjianbai.com/unite-tokenization-generation/) Code: [this https URL](https://github.com/ShivamDuggal4/UNITE-tokenization-generation)

点击查看摘要

Abstract:Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization generation from scratch is feasible.

4. 【2603.22282】UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

链接https://arxiv.org/abs/2603.22282

作者:Ziyi Wang,Xinshun Wang,Shuang Chen,Yang Cong,Mengyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:natural language, single architecture, framework for simultaneous, unified framework, Aligned Motion VAE

备注: 42 pages, 16 figures

点击查看摘要

Abstract:We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

5. 【2603.22281】hinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

链接https://arxiv.org/abs/2603.22281

作者:Haichao Zhang,Yijiang Li,Shwai He,Tushar Nagarajan,Mingfei Chen,Jianglin Lu,Ang Li,Yun Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:shown promising capability, forecasting future world, Recent progress, future world states, shown promising

备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

6. 【2603.22280】DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

链接https://arxiv.org/abs/2603.22280

作者:Zhide Zhong,Junfeng Li,Junjie He,Haodong Yan,Xin Gong,Guanyi Zhao,Yingjie Cai,Jiantao Gao,Xu Yan,Bingbing Liu,Yingcong Chen,Liuqing Yang,Haoang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:language instructions directly, VLA models, map visual observations, standard VLA models, robotic actions

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

7. 【2603.22279】3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

链接https://arxiv.org/abs/2603.22279

作者:Haoyu Zhen,Xiaolong Li,Yilin Zhao,Han Zhang,Sifei Liu,Kaichun Mo,Chuang Gan,Subhashree Radhakrishnan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Vision Language Models, Large Language, Vision Language, performing fine-grained visual

备注

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.

8. 【2603.22278】he Dual Mechanisms of Spatial Reasoning in Vision-Language Models

链接https://arxiv.org/abs/2603.22278

作者:Kelly Cui,Nikhil Prakash,Ayush Raina,David Bau,Antonio Torralba,Tamar Rott Shaham

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:require vision-language models, visual question answering, multimodal tasks, question answering, require vision-language

备注: 26 pages, 35 figures

点击查看摘要

Abstract:Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.

9. 【2603.22275】Repurposing Geometric Foundation Models for Multi-view Diffusion

链接https://arxiv.org/abs/2603.22275

作者:Wooseok Jang,Seonghu Jeon,Jisang Han,Jinhyeok Choi,Minkyung Kwon,Seungryong Kim,Saining Xie,Sainan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:driven substantial progress, VAE latent space, latent space, remains largely unexplored, optimal latent space

备注: project website: [this https URL](https://cvlab-kaist.github.io/GLD/)

点击查看摘要

Abstract:While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

10. 【2603.22271】DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

链接https://arxiv.org/abs/2603.22271

作者:Zhengyao Lv,Menghan Xia,Xintao Wang,Kwan-Yee K. Wong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion-based video super-resolution, prohibitive sampling costs, recently achieved remarkable, achieved remarkable fidelity, Diffusion-based video

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.

11. 【2603.22270】GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning

链接https://arxiv.org/abs/2603.22270

作者:Yixuan Luo,Feng Qiao,Zhexiao Xiong,Yanjing Li,Nathan Jacobs

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:expensive ground-truth annotations, ground-truth annotations limits, computer vision, fundamental problem, problem in computer

备注

点击查看摘要

Abstract:Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.

12. 【2603.22249】EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild

链接https://arxiv.org/abs/2603.22249

作者:Jeffri Murrugarra-Llerena,Pranav Chitale,Zicheng Liu,Kai Ao,Yujin Ham,Guha Balakrishnan,Paola Cascante-Bonilla

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reciprocal interpersonal interactions, social intelligence needed, Social group detection, group detection, family members

备注: Project Page: [this https URL](https://lab-spell.github.io/EgoGroups/)

点击查看摘要

Abstract:Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.

13. 【2603.22230】Riverine Land Cover Mapping through Semantic Segmentation of Multispectral Point Clouds

链接https://arxiv.org/abs/2603.22230

作者:Sopitta Thurachen,Josef Taher,Matti Lehtomäki,Leena Matikainen,Linnea Blåfield,Mikel Calle Navarro,Antero Kukko,Tomi Westerlund,Harri Kaartinen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:land cover mapping, Accurate land cover, land cover, cover mapping, point cloud data

备注

点击查看摘要

Abstract:Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model's generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.

14. 【2603.22229】Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre

链接https://arxiv.org/abs/2603.22229

作者:Alex Salvatierra,José Antonio Sanz,Christian Gutiérrez,Mikel Galar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, significantly improved, focus on indoor, indoor or terrestrial, Superpoint Transformer

备注: 6 pages, 2 figures

点击查看摘要

Abstract:Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.

15. 【2603.22228】SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

链接https://arxiv.org/abs/2603.22228

作者:Sashuai Zhou,Qiang Zhou,Junpeng Ma,Yue Cao,Ruofan Hu,Ziang Zhang,Xiaoda Yang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Zhou Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advances, assess semantic alignment, reinforcement learning, reward models, semantic alignment

备注

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

16. 【2603.22212】Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

链接https://arxiv.org/abs/2603.22212

作者:Meiqi Wu,Zhixin Cai,Fufangchen Zhao,Xiaokun Feng,Rujing Dang,Bingze Song,Ruitian Tian,Jiashu Zhu,Jiachen Lei,Hao Dou,Jing Tang,Lei Sun,Jiahong Wu,Xiangxiang Chu,Zeming Liu,Kaiqi Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video, based world models, world, world models, models

备注

点击查看摘要

Abstract:Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

17. 【2603.22198】Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning

链接https://arxiv.org/abs/2603.22198

作者:Daniel Shao,Joel Runevic,Richard J. Chen,Drew F.K. Williamson,Ahrong Kim,Andrew H. Song,Faisal Mahmood

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Instance Learning, classifying gigapixel whole-slide, gigapixel whole-slide images, Multiple Instance, Instance Learning

备注: Published in ICLR 2026 (37 pages, 16 figures)

点击查看摘要

Abstract:Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across eight MIL methods and 19 different classification tasks, we find that such task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average $+3.8\%$ change in performance. Code is available at this https URL.

18. 【2603.22193】PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation

链接https://arxiv.org/abs/2603.22193

作者:Mingju Gao,Kaisen Yang,Huan-ang Gao,Bohan Li,Ao Ding,Wenyi Li,Yangcheng Yu,Jinkun Liu,Shaocong Xu,Yike Niu,Haohan Chi,Hao Chen,Hao Tang,Li Yi,Hao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Hand-object interaction, HOI generation, central to embodied, HOI, existing HOI generation

备注: Accepted to CVPR 2026 Code: [this https URL](https://github.com/GasaiYU/PAM)

点击查看摘要

Abstract:Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.

19. 【2603.22190】A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis

链接https://arxiv.org/abs/2603.22190

作者:Shukesh Reddy,Abhijit Das

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:face analysis, Toggle, self-supervised auxiliary task, Toggle Hugging Face, auxiliary task

备注: Accepted for publication in SN Computer Science

点击查看摘要

Abstract:In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: "What is the role of the backbone in performance L-SSAT?", "What type of backbone is effective for different face analysis tasks?", and "Is there any generalized backbone for effective face analysis with L-SSAT?". Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.

Comments:
Accepted for publication in SN Computer Science

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.22190 [cs.CV]

(or
arXiv:2603.22190v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.22190

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shukesh Reddy [view email] [v1]
Mon, 23 Mar 2026 16:49:50 UTC (2,611 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis, by Shukesh Reddy and Abhijit DasView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2026-03

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

20. 【2603.22187】Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

链接https://arxiv.org/abs/2603.22187

作者:Junrong Guo,Shancheng Fang,Yadong Qu,Hongtao Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, natural language descriptions, Multimodal Large, Large Language

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at this https URL.

21. 【2603.22165】ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

链接https://arxiv.org/abs/2603.22165

作者:Kaili Huang,Hongming Zhang,Rui Shen,Linjun Dai,Jiahao Wang,Hanming Deng,Lewei Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aligning Large Vision-Language, Direct Preference Optimization, Large Vision-Language Models, Visual Anchor Collapse, rejected responses collapses

备注

点击查看摘要

Abstract:While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

22. 【2603.22154】dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

链接https://arxiv.org/abs/2603.22154

作者:Alois Bachmann

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:per-layer trainable activation, trainable activation defined, lightweight learned scalars, mathrm, beta

备注: 22 pages, 15 figures

点击查看摘要

Abstract:This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, where $\alpha_i$ and $\beta_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.

Comments:
22 pages, 15 figures

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

ACMclasses:
I.2.6; I.5.1

Cite as:
arXiv:2603.22154 [cs.LG]

(or
arXiv:2603.22154v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.22154

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2603.22153】Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

链接https://arxiv.org/abs/2603.22153

作者:Kejia Liu,Haoyang Zhou,Ruoyu Xu,Peicheng Wang,Mingli Song,Haofei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:unmanned aerial vehicle, shown strong potential, supporting unmanned aerial, Recent advances, aerial vehicle

备注: Accepted as a conference paper by CVPR2026

点击查看摘要

Abstract:Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV's heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.

24. 【2603.22148】OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation

链接https://arxiv.org/abs/2603.22148

作者:Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Xinyu Gu,Zhe Jiang,Fenghua Ling,Ben Fei,Wenlong Zhang,Junjue Wang,Weihao Xuan,Pengfeng Xiao,Naoto Yokoya,Lei Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth Observation, perceiving dynamic land, dynamic land surface, perceiving dynamic, dynamic land

备注: 15 pages, 4 figures

点击查看摘要

Abstract:Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents' adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.

25. 【2603.22125】DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

链接https://arxiv.org/abs/2603.22125

作者:Xin Cai,Zhiyuan You,Zhoutong Zhang,Tianfan Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reducing token count, Reducing token, count is crucial, crucial for efficient, times

备注: CVPR 2026

点击查看摘要

Abstract:Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbf{D}etail-\textbf{A}ligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first $C$ channels come directly from the pretrained VAE at a base resolution, while an additional $D$ channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables $1024 \times 1024$ image generation with Stable Diffusion 3.5 using only $32 \times 32$ tokens, $4\times$ fewer than the original model, within 5 H100-days. It further unlocks $2048 \times 2048$ generation with SD3.5, achieving a $6\times$ speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.

26. 【2603.22123】Biophysics-Enhanced Neural Representations for Patient-Specific Respiratory Motion Modeling

链接https://arxiv.org/abs/2603.22123

作者:Jan Boysen,Hristina Uzunova,Heinz Handels,Jan Ehrhardt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise spatial delivery, respiratory motion, success in radiotherapy, motion, precise spatial

备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) [this https URL](https://melba-journal.org/2026:008)

点击查看摘要

Abstract:A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.

27. 【2603.22121】Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

链接https://arxiv.org/abs/2603.22121

作者:Yunzhuo Sun,Xinyue Liu,Yanyang Li,Nanding Wu,Yifang Xu,Linlin Zong,Xianchao Zhang,Wenxin Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains challenging due, Text-driven video moment, hidden temporal dynamics, Text-driven video, remains challenging

备注: The paper is accepted by CVPR-2026

点击查看摘要

Abstract:Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.

28. 【2603.22120】StreamingClaw Technical Report

链接https://arxiv.org/abs/2603.22120

作者:Jiawei Chen,Zhe Chen,Chaoqun Du,Maokui He,Wei He,Hengtao Li,Qizhen Li,Zide Liu,Hao Ma,Xuhao Pan,Chang Ren,Xudong Rao,Xintian Shen,Chenfeng Wang,Tao Wei,Chengjun Yu,Pengfei Yu,Shengyu Yao,Chunpeng Zhou,Kun Zhan,Lihao Zheng,Pan Zhou,Xuhan Zhu,Yufei Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:posing stringent challenges, video understanding, streaming video understanding, closed loop, posing stringent

备注: Under Progress

点击查看摘要

Abstract:Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

29. 【2603.22102】FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario

链接https://arxiv.org/abs/2603.22102

作者:Hang Dai,Hongwei Fan,Han Zhang,Duojin Wu,Jiyao Zhang,Hao Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:increasing demand, demand for augmented, augmented reality, reality and robotics, robotics is driving

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: this https URL

30. 【2603.22094】Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

链接https://arxiv.org/abs/2603.22094

作者:Xingyu Zhu,Beier Zhu,Shuo Wang,Junfeng Fang,Kesen Zhao,Hanwang Zhang,Xiangnan He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generate harmful content, open-world scenarios, posing serious risks, trustworthy usage, increasingly deployed

备注: CVPR 2026

点击查看摘要

Abstract:As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.

31. 【2603.22091】P-Flow: Prompting Visual Effects Generation

链接https://arxiv.org/abs/2603.22091

作者:Rui Zhao,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, dynamic visual effects, significantly improved, improved their ability, ability to follow

备注

点击查看摘要

Abstract:Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at this https URL.

32. 【2603.22070】Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

链接https://arxiv.org/abs/2603.22070

作者:Xingyu Zhu,Liang Yi,Shuo Wang,Wenbo Zhu,Yonglinag Wu,Beier Zhu,Hanwang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:show strong generalization, vision-language models show, generalization across diverse, models show strong, show strong

备注: CVPR 2026

点击查看摘要

Abstract:Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.

33. 【2603.22057】SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

链接https://arxiv.org/abs/2603.22057

作者:Byungwoo Jeon,Dongyoung Kim,Huiwon Jang,Insoo Kim,Jinwoo Shin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Model, image representation models, fail to capture, real world, downstream applications

备注: 35 pages; 7 figures

点击查看摘要

Abstract:Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

34. 【2603.22054】FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation

链接https://arxiv.org/abs/2603.22054

作者:Wuyang Luo,Chengkai Tan,Chang Ge,Binye Hong,Su Yang,Yongjiu Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:stylized glyphs based, synthesize stylized glyphs, Artistic font generation, aims to synthesize, synthesize stylized

备注: To appear in CVPR 2026

点击查看摘要

Abstract:Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.

35. 【2603.22042】Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

链接https://arxiv.org/abs/2603.22042

作者:Hayeon Kim,Ji Ha Jang,Junghun James Kim,Se Young Chun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Euclidean embeddings remain, embeddings remain limited, achieved remarkable performance, Euclidean embeddings, achieved remarkable

备注

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: this https URL.

36. 【2603.22041】DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

链接https://arxiv.org/abs/2603.22041

作者:Binhong Tan,Zhaoxin Wang,Handing Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant safety concerns, content raises significant, raises significant safety, strong generation ability, demonstrated strong generation

备注

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.

37. 【2603.22036】GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction

链接https://arxiv.org/abs/2603.22036

作者:Youwen Yuan,Xi Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing translucent objects, translucent objects, Reconstructing translucent, difficult problem, translucent

备注

点击查看摘要

Abstract:Reconstructing translucent objects from multi-view images is a difficult problem. Previously, researchers have used differentiable path tracing and the neural implicit field, which require relatively large computational costs. Recently, many works have achieved good reconstruction results for opaque objects based on a 3DGS pipeline with much higher efficiency. However, such methods have difficulty dealing with translucent objects, because they do not consider the optical properties of translucent objects. In this paper, we propose a novel 3DGS-based pipeline (GTSR) to reconstruct the surface geometry of translucent objects. GTSR combines two sets of Gaussians, surface and interior Gaussians, which are used to model the surface and scattering color when lights pass translucent objects. To render the appearance of translucent objects, we introduce a method that uses the Fresnel term to blend two sets of Gaussians. Furthermore, to improve the reconstructed details of non-contour areas, we introduce the Disney BSDF model with deferred rendering to enhance constraints of the normal and depth. Experimental results demonstrate that our method outperforms baseline reconstruction methods on the NeuralTO Syn dataset while showing great real-time rendering performance. We also extend the dataset with new translucent objects of varying material properties and demonstrate our method can adapt to different translucent materials.

38. 【2603.22027】uning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models

链接https://arxiv.org/abs/2603.22027

作者:Purui Bai,Junxian Duan,Pin Wang,Jinhua Hao,Ming Sun,Chao Zhou,Huaibo Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, diffusion-based real-world image, remain significant challenges, potential remain significant, real-world image restoration

备注: 27 pages, 10 figures

点击查看摘要

Abstract:Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.

39. 【2603.22012】6D Robotic OCT Scanning of Curved Tissue Surfaces

链接https://arxiv.org/abs/2603.22012

作者:Suresh Guttikonda,Maximilian Neidhardt,Vidas Raudonis,Alexander Schlaefer

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Optical coherence tomography, Optical coherence, volumetric imaging modality, non-invasive volumetric imaging, coherence tomography

备注: Accepted at IEEE ISBI 2026

点击查看摘要

Abstract:Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.

40. 【2603.22002】SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation

链接https://arxiv.org/abs/2603.22002

作者:Duy D. Nguyen,Phat T. Tran-Truong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Convolutional Neural Networks, Neural Networks, Convolutional Neural, enabling global contextual, capability traditionally limited

备注

点击查看摘要

Abstract:The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.

41. 【2603.21999】STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection

链接https://arxiv.org/abs/2603.21999

作者:Jianlin Chen,Gongyang Li,Zhijiang Zhang,Liang Chang,Dan Zeng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Salient Object Detection, RGB-D Salient Object, Object Detection, Salient Object, gained significant interest

备注: 12 pages, 8 figures, accepted by IEEE TMM

点击查看摘要

Abstract:Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at this https URL.

42. 【2603.21987】LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving

链接https://arxiv.org/abs/2603.21987

作者:Nour Alhuda Albashir,Lars Pernickel,Danial Hamoud,Idriss Gouigah,Eren Erdal Aksoy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:vehicles face major, face major perception, RGB camera sensors, Autonomous vehicles face, RGB camera

备注: Accepted for publication at IEEE Intelligent Vehicles Symposium - IVS 2026

点击查看摘要

Abstract:Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird's Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in this https URL.

43. 【2603.21986】Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

链接https://arxiv.org/abs/2603.21986

作者:SII-GAIR,Sand.ai:Ethan Chern,Hansi Teng,Hanwen Sun,Hao Wang,Hong Pan,Hongyu Jia,Jiadi Su,Jin Li,Junjie Yu,Lijie Liu,Lingzhi Li,Lyumanshan Ye,Min Hu,Qiangang Wang,Quanwei Qi,Steffi Chern,Tao Bu,Taoran Wang,Teren Xu,Tianning Zhang,Tiantian Mi,Weixian Xu,Wenqiang Zhang,Wentai Zhang,Xianping Yi,Xiaojie Cai,Xiaoyang Kang,Yan Ma,Yixiu Liu,Yunbo Zhang,Yunpeng Huang,Yutong Lin,Zewei Tao,Zhaoliang Liu,Zheng Zhang,Zhiyao Cen,Zhixuan Yu,Zhongshu Wang,Zhulin Hu,Zijin Zhou,Zinan Guo,Yue Cao,Pengfei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:audio-video generative foundation, generative foundation model, generative foundation, open-source audio-video generative, present daVinci-MagiHuman

备注

点击查看摘要

Abstract:We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

44. 【2603.21978】GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design

链接https://arxiv.org/abs/2603.21978

作者:Xiaolei Zhou,Chuangjie Fang,Jie Wu,Jingyi Yang,Boyi Lin,Jianwei Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Parametric Computer-Aided Design, existing methods struggle, Computer-Aided Design, fundamental to modern, existing methods

备注: Accepted to CVPR 2026 (Findings). Includes supplementary material

点击查看摘要

Abstract:Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.

45. 【2603.21966】BHDD: A Burmese Handwritten Digit Dataset

链接https://arxiv.org/abs/2603.21966

作者:Swan Htet Aung,Hein Htet,Htoo Say Wah Khaing,Thuya Myo Nyunt

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Burmese Handwritten Digit, handwritten Burmese digits, Burmese Handwritten, handwritten Burmese, Handwritten Digit Dataset

备注: 4 pages, 9 figures, 1 table. Dataset available at [this https URL](https://github.com/baseresearch/BHDD)

点击查看摘要

Abstract:We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at this https URL

46. 【2603.21957】Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

链接https://arxiv.org/abs/2603.21957

作者:Junhao Du,Jialong Xue,Anqi Li,Jincheng Dai,Guo Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, computational costs due, Video large language, face high computational, high computational costs

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.

47. 【2603.21944】Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

链接https://arxiv.org/abs/2603.21944

作者:Youbin Kim,Jinho Park,Hogun Park,Eunbyung Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fixed training taxonomy, training taxonomy, aims to localize, localize and recognize, fixed training

备注: 24 pages, 7 figures, Project page: [this https URL](https://ubin108.github.io/Group3D/)

点击查看摘要

Abstract:Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at this https URL.

48. 【2603.21943】GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

链接https://arxiv.org/abs/2603.21943

作者:Ayesh Abu Lehyeh,Xiaohan Zhang,Ahmad Arrabi,Waqas Sultani,Chen Chen,Safwan Wshah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:safe autonomous navigation, Accurate and fast, GPS-denied areas, vital for safe, safe autonomous

备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)

点击查看摘要

Abstract:Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively 'flow' from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.

49. 【2603.21939】FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

链接https://arxiv.org/abs/2603.21939

作者:Zhilin Tu,Kemou Li,Fengpeng Li,Jianwei Fei,Jiamin Zhang,Haiwei Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Robust AI-Generated Image, AI-generated image detection, images increasingly important, forged images increasingly, AI-generated forged images

备注: 6th place (6/507) technical report at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge

点击查看摘要

Abstract:The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild'' conditions, offering an effective and practical solution for real-world deepfake image detection.

Comments:
6th place (6/507) technical report at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Cite as:
arXiv:2603.21939 [cs.CV]

(or
arXiv:2603.21939v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.21939

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2603.21937】MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation

链接https://arxiv.org/abs/2603.21937

作者:Wenqing Tian,Hanyi Mao,Zhaocheng Liu,Lihua Zhang,Qiang Liu,Jian Wu,Liang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Subject-driven image generation, support fine-grained control, Subject-driven image, generation is increasingly, increasingly expected

备注

点击查看摘要

Abstract:Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.

51. 【2603.21936】Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment

链接https://arxiv.org/abs/2603.21936

作者:Roy Amoyal,Oren Freifeld,Chaim Baskin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting Alignment, present Gaussian Splatting, Gaussian Splatting, Splatting Alignment, present Gaussian

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: this https URL

52. 【2603.21935】Chronological Contrastive Learning: Few-Shot Progression Assessment in Irreversible Diseases

链接https://arxiv.org/abs/2603.21935

作者:Clemens Watzenböck,Daniel Aletaha,Michaël Deman,Thomas Deimel,Jana Eder,Ivana Janickova,Robert Janiczek,Peter Mandl,Philipp Seeböck,Gabriela Supp,Paul Weiser,Georg Langs

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Quantitative disease severity, inter-reader variability, Quantitative disease, scoring in medical, subject to inter-reader

备注: Accepted for MIDL 2026; Reviews available at [this https URL](https://openreview.net/forum?id=c1UkGC3MVq)

点击查看摘要

Abstract:Quantitative disease severity scoring in medical imaging is costly, time-consuming, and subject to inter-reader variability. At the same time, clinical archives contain far more longitudinal imaging data than expert-annotated severity scores. Existing self-supervised methods typically ignore this chronological structure. We introduce ChronoCon, a contrastive learning approach that replaces label-based ranking losses with rankings derived solely from the visitation order of a patient's longitudinal scans. Under the clinically plausible assumption of monotonic progression in irreversible diseases, the method learns disease-relevant representations without using any expert labels. This generalizes the idea of Rank-N-Contrast from label distances to temporal ordering. Evaluated on rheumatoid arthritis radiographs for severity assessment, the learned representations substantially improve label efficiency. In low-label settings, ChronoCon significantly outperforms a fully supervised baseline initialized from ImageNet weights. In a few-shot learning experiment, fine-tuning ChronoCon on expert scores from only five patients yields an intraclass correlation coefficient of 86% for severity score prediction. These results demonstrate the potential of chronological contrastive learning to exploit routinely available imaging metadata to reduce annotation requirements in the irreversible disease domain. Code is available at this https URL.

53. 【2603.21933】Camera-Agnostic Pruning of 3D Gaussian Splats via Descriptor-Based Beta Evidence

链接https://arxiv.org/abs/2603.21933

作者:Peter Fasogbon,Ugurcan Budak,Patrice Rondao Alface,Hamed Rezazadegan Tavakoli

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:enable efficient storage, efficient storage, downstream processing, essential for reducing, reducing their complexity

备注: 14 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies.

Comments:
14 pages, 3 figures, 2 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2603.21933 [cs.CV]

(or
arXiv:2603.21933v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.21933

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
54. 【2603.21931】SatGeo-NeRF: Geometrically Regularized NeRF for Satellite Imagery

链接https://arxiv.org/abs/2603.21931

作者:Valentin Wagner,Sebastian Bullinger,Michael Arens,Rainer Stiefelhagen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geometrically regularized NeRF, mitigates overfitting-induced geometric, overfitting-induced geometric artifacts, geometric artifacts observed, observed in current

备注: Accepted at the ISPRS Congress 2026

点击查看摘要

Abstract:We present SatGeo-NeRF, a geometrically regularized NeRF for satellite imagery that mitigates overfitting-induced geometric artifacts observed in current state-of-the-art models using three model-agnostic regularizers. Gravity-Aligned Planarity Regularization aligns depth-inferred, approximated surface normals with the gravity axis to promote local planarity, coupling adjacent rays via a corresponding surface approximation to facilitate cross-ray gradient flow. Granularity Regularization enforces a coarse-to-fine geometry-learning scheme, and Depth-Supervised Regularization stabilizes early training for improved geometric accuracy. On the DFC2019 satellite reconstruction benchmark, SatGeo-NeRF improves the Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines such as EO-NeRF and EO-GS.

55. 【2603.21928】he Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation

链接https://arxiv.org/abs/2603.21928

作者:Guannan Lai,Da-Wei Zhou,Zhenguo Li,Han-Jia Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Continual Test-Time Adaptation, unlabeled data streams, accessing source data, Continual Test-Time, unlabeled data

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at this https URL.

56. 【2603.21911】A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing

链接https://arxiv.org/abs/2603.21911

作者:Chedly Ben Azizi,Claire Guilloteau,Gilles Roussel,Matthieu Puigt

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Synthetic hyperspectral image, traditional radiative transfer, radiative transfer models, transfer models remain, models remain computationally

备注

点击查看摘要

Abstract:Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.

57. 【2603.21904】SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation

链接https://arxiv.org/abs/2603.21904

作者:Linkuan Zhou,Yinghao Xia,Yufei Shen,Xiangyu Li,Wenjie Du,Cong Cong,Leyi Wei,Ran Su,Qiangguo Jin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Unsupervised Domain Adaptation, diverse clinical environments, deploying medical segmentation, medical segmentation models, Hierarchical Unsupervised Domain

备注

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI-CT) and 78.51% (CT-MRI) on cardiac data, and 87.48% (MRI-CT) and 86.89% (CT-MRI) on abdominal data. The code is available at this https URL.

58. 【2603.21901】CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

链接https://arxiv.org/abs/2603.21901

作者:Qingdong He,Chaoyi Wang,Peng Tang,Yifan Yang,Xiaobin Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving temporal coherence, distinguish text overlays, Video subtitle removal, temporal coherence, subtitle removal aims

备注

点击查看摘要

Abstract:Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

59. 【2603.21886】ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

链接https://arxiv.org/abs/2603.21886

作者:Zhuocheng Zhang,Xingwu Zhang,Kangheng Liang,Guanxuan Li,Richard Mccreadie,Zijun Long

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, resulting in increased, increased effectiveness, textual information, Recent

备注

点击查看摘要

Abstract:Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.

60. 【2603.21884】Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation

链接https://arxiv.org/abs/2603.21884

作者:Donald Shenaj,Federico Errica,Antonio Carta

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Low Rank Adaptation, pre-trained diffusion models, generate personalized images, facto fine-tuning strategy, Low Rank

备注: Project page: [this https URL](https://donaldssh.github.io/NotAllLayersAreCreatedEqual/)

点击查看摘要

Abstract:Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community's consensus, regardless of the personalized subject's complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank's positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA$^2$, achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: this https URL.

61. 【2603.21882】Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline

链接https://arxiv.org/abs/2603.21882

作者:Elías Masquil,Thibaud Ehret,Pablo Musé,Gabriele Facciolo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Satellite Stereo Pipeline, stereoscopic matching algorithms, Digital Surface Model, learning-based stereo matchers, Stereo Pipeline

备注: Accepted at IGARSS 2026

点击查看摘要

Abstract:Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.

62. 【2603.21876】hermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

链接https://arxiv.org/abs/2603.21876

作者:Chengyin Hu,Yikun Guo,Yuxian Dong,Qike Zhang,Kalibinuer Tiliwalidi,Yiwei Wei,Haitao Shi,Jiujiang Guo,Jiahuan Long,Xiang Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual perception tasks, infrared pedestrian detectors, physical adversarial attacks, perception tasks, increasingly apparent

备注

点击查看摘要

Abstract:Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.

63. 【2603.21872】Manifold-Aware Exploration for Reinforcement Learning in Video Generation

链接https://arxiv.org/abs/2603.21872

作者:Mingzhe Zheng,Weijie Kong,Yue Wu,Dengyang Jiang,Yue Ma,Xuanhua He,Bin Lin,Kaixiong Gong,Zhao Zhong,Liefeng Bo,Qifeng Chen,Harry Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Group Relative Policy, Relative Policy Optimization, Group Relative, Policy Optimization, Relative Policy

备注: 17 pages, 12 figures

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at this https URL.

64. 【2603.21867】Adversarial Camouflage

链接https://arxiv.org/abs/2603.21867

作者:Paweł Borsukiewicz,Daniele Lunghi,Melissa Tessa,Jacques Klein,Tegawendé F. Bissyandé

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:numerous beneficial applications, enabled numerous beneficial, raised significant concerns, beneficial applications, rapid development

备注: 18 pages, 4 figures, 5 tables

点击查看摘要

Abstract:While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users' privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.

65. 【2603.21864】Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

链接https://arxiv.org/abs/2603.21864

作者:Yuyang You,Yongzhi Li,Jiahui Li,Yadong Mu,Quan Chen,Peng Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generation has recently, recently emerged, central task, field of generative, video diffusion models

备注

点击查看摘要

Abstract:Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.

66. 【2603.21856】Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning

链接https://arxiv.org/abs/2603.21856

作者:Sulian Thual,Feiyang Cai,Jingjing Wang,Feng Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative Deep Learning, Generative Deep, remains poorly understood, Deep Learning, traditional theoretical frameworks

备注

点击查看摘要

Abstract:Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.

67. 【2603.21829】Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation

链接https://arxiv.org/abs/2603.21829

作者:Xiaochan Yuan,Pai Zeng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computed tomography angiography, paramount clinical importance, Accurate segmentation, tomography angiography, cardiovascular diseases

备注

点击查看摘要

Abstract:Accurate segmentation of coronary arteries from computed tomography angiography (CTA) images is of paramount clinical importance for the diagnosis and treatment planning of cardiovascular diseases. However, coronary artery segmentation remains challenging due to the inherent multi-branching and slender tubular morphology of the vasculature, compounded by severe class imbalance between foreground vessels and background tissue. Conventional convolutional neural network (CNN)-based approaches struggle to capture long-range dependencies among spatially distant vascular structures, while Vision Transformer (ViT)-based methods incur prohibitive computational overhead that hinders deployment in resource-constrained clinical settings. Motivated by the recent success of state space models (SSMs) in efficiently modeling long-range sequential dependencies with linear complexity, we propose MDSVM-UNet, a novel two-stage coronary artery segmentation framework that synergistically integrates multidirectional snake convolution (MDSConv) with residual visual Mamba (RVM). In the encoding stage, we introduce MDSConv, a deformable convolution module that learns adaptive offsets along three orthogonal anatomical planes -- sagittal, coronal, and axial -- thereby enabling comprehensive multi-view feature fusion that faithfully captures the elongated and tortuous geometry of coronary vessels. In the decoding stage, we design an RVM-based upsampling decoder block that leverages selective state space mechanisms to model inter-slice long-range dependencies while preserving linear computational complexity. Furthermore, we propose a progressive two-stage segmentation strategy: the first stage performs coarse whole-image segmentation to guide intelligent block extraction, while the second stage conducts fine-grained block-level segmentation to recover vascular details and suppress false positives..

68. 【2603.21824】SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

链接https://arxiv.org/abs/2603.21824

作者:Shuxian Zhao,Jie Gui,Baosheng Yu,Lu Dong,Zhipeng Gui

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ensuring product quality, modern manufacturing, essential for ensuring, ensuring product, product quality

备注: This paper was submitted to CVPR 2026. A revised version will be updated soon

点击查看摘要

Abstract:Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on this https URL.

69. 【2603.21820】Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

链接https://arxiv.org/abs/2603.21820

作者:Yanglin Deng,Tianyang Xu,Chunyang Cheng,Hui Li,Xiao-jun Wu,Josef Kittler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:salient thermal signatures, preserving natural textures, combines complementary modalities, visible image fusion, Paired Training Paradigm

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{this https URL}{\textcolor{blue}{this https URL\_unpair}}.

70. 【2603.21819】Ctrl-A: Control-Driven Online Data Augmentation

链接https://arxiv.org/abs/2603.21819

作者:Jesper B. Christensen,Ciaran Bench,Spencer A. Thomas,Hüsnü Aslan,David Balslev-Harder,Nadia A. S. Smith,Alessandra Manzin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)

关键词:augmentation strength distributions, introduce ControlAugment, incorporates principles, principles from control, control theory

备注: 17 pages (11 pages main manuscript), 8 figures (5 in main manuscript)

点击查看摘要

Abstract:We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.

71. 【2603.21809】Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction

链接https://arxiv.org/abs/2603.21809

作者:Dillan Imans,Phuoc-Nguyen Bui,Duc-Tai Le,Hyunseung Choo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:HTN-related retinal cues, imaging enables low-cost, Retinal fundus imaging, HTN-related retinal, retinal cues

备注: 10 pages, 2 figures, 2 tables. Under review at MICCAI 2026

点击查看摘要

Abstract:Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at this https URL.

72. 【2603.21808】Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

链接https://arxiv.org/abs/2603.21808

作者:Lei Yang,Yi He,Fei Wu,Shilin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:mandarin visual speech, Chinese mandarin visual, visual speech recognition, existing Chinese VSR, Chinese VSR systems

备注

点击查看摘要

Abstract:Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.

73. 【2603.21806】Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition

链接https://arxiv.org/abs/2603.21806

作者:Lev Ayzenberg,Shady Abu-Hussein,Raja Giryes,Hayit Greenspan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Full data acquisition, increases patient discomfort, limits clinical throughput, Compressed Sensing MRI, Full data

备注

点击查看摘要

Abstract:Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at this https URL.

74. 【2603.21803】ming In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing

链接https://arxiv.org/abs/2603.21803

作者:Yaelle Zribi(ENC),Florian Cafiero(ENC, LRE),Vincent Lépinay,Chahan Vidal-Gorène(CJM, LIPN)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:humor in general, Stand-up comedy, stand-up comedy specials, filmed stand-up comedy, professionally filmed stand-up

备注

点击查看摘要

Abstract:Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.

75. 【2603.21787】Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent

链接https://arxiv.org/abs/2603.21787

作者:Lokeshwaran Manohar,Moritz Roidl

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high dynamic range, reduced motion blur, Event cameras, high temporal resolution, provide high temporal

备注

点击查看摘要

Abstract:Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

76. 【2603.21786】he Universal Normal Embedding

链接https://arxiv.org/abs/2603.21786

作者:Chen Tasker,Roy Betser,Eyal Gofer,Meir Yossef Levi,Guy Gilboa

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:separate tracks, mathematical principles, largely advanced, advanced on separate, goals and grounded

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available this https URL

77. 【2603.21785】Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

链接https://arxiv.org/abs/2603.21785

作者:Simone Nascivera,Leonard Bauersfeld,Jeff Delaune,Davide Scaramuzza

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Resource-constrained autonomous robots, Resource-constrained autonomous, autonomous robots rely, tradeoff between accuracy, autonomous robots

备注

点击查看摘要

Abstract:Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.

78. 【2603.21784】Dynamic Exposure Burst Image Restoration

链接https://arxiv.org/abs/2603.21784

作者:Woohyeok Kim,Jaesung Rim,Daeyeon Kim,Sunghyun Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Burst image restoration, Burst image, manually designed exposure, designed exposure settings, image restoration

备注

点击查看摘要

Abstract:Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings. Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked. In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment. In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain. Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times. For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.

79. 【2603.21783】SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

链接https://arxiv.org/abs/2603.21783

作者:Bingxuan Zhao,Qing Zhou,Chuang Yang,Qi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made remarkable strides, Rotary Position Embedding, Diffusion Transformers, remarkable strides, remote sensing

备注

点击查看摘要

Abstract:Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at this https URL.

80. 【2603.21754】Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

链接https://arxiv.org/abs/2603.21754

作者:Xu Liu,Yongheng Zhang,Qiguang Chen,Yao Li,Sheng Wang,Libo Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:attracting increasing attention, achieved remarkable success, Static Visual Thought, Broken Visual Thought, Visual Thought Positioning

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

81. 【2603.21746】Getting to the Point: Why Pointing Improves LVLMs

链接https://arxiv.org/abs/2603.21746

作者:Simone Alghisi,Massimo Rizzoli,Seyed Mahed Mousavi,Giuseppe Riccardi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, explicit sequential steps, explainability of Large, Large Vision-Language, sequential steps

备注

点击查看摘要

Abstract:Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.

82. 【2603.21716】When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?

链接https://arxiv.org/abs/2603.21716

作者:Bahar Dibaei Nia,Farzan Farnia

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Efficient selection, increasingly important, important in modern, Efficient, suboptimal models

备注

点击查看摘要

Abstract:Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emph{Mixture-Greedy} strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.

83. 【2603.21708】Compensating Visual Insufficiency with Stratified Language Guidance for Long-Tail Class Incremental Learning

链接https://arxiv.org/abs/2603.21708

作者:Xi Wang,Xu Yang,Donghao Sun,Cheng Deng

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Long-tail class incremental, remains highly challenging, class incremental learning, Long-tail class, remains highly

备注

点击查看摘要

Abstract:Long-tail class incremental learning (LT CIL) remains highly challenging because the scarcity of samples in tail classes not only hampers their learning but also exacerbates catastrophic forgetting under continuously evolving and imbalanced data distributions. To tackle these issues, we exploit the informativeness and scalability of language knowledge. Specifically, we analyze the LT CIL data distribution to guide large language models (LLMs) in generating a stratified language tree that hierarchically organizes semantic information from coarse to fine grained granularity. Building upon this structure, we introduce stratified adaptive language guidance, which leverages learnable weights to merge multi-scale semantic representations, thereby enabling dynamic supervisory adjustment for tail classes and alleviating the impact of data imbalance. Furthermore, we introduce stratified alignment language guidance, which exploits the structural stability of the language tree to constrain optimization and reinforce semantic visual alignment, thereby alleviating catastrophic forgetting. Extensive experiments on multiple benchmarks demonstrate that our method achieves state of the art performance.

84. 【2603.21701】Rethinking Token Reduction for Large Vision-Language Models

链接https://arxiv.org/abs/2603.21701

作者:Yi Wang,Haofei Zhang,Qihan Huang,Anda Cao,Gongfan Fang,Wei Wang,Xuan Jin,Jie Song,Mingli Song,Xinchao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, Visual Question Answering, high inference costs

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at this https URL.

85. 【2603.21700】PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma

链接https://arxiv.org/abs/2603.21700

作者:Zelin Liu,Xiangfu Yu,Jie Huang,Ge Wang,Yizhe Yuan,Zhenyu Yi,Jing Xie,Haotian Jiang,Lichi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rare neuroendocrine tumors, develop metastatic disease, Pheochromocytomas and paragangliomas, survival rates reported, neuroendocrine tumors

备注

点击查看摘要

Abstract:Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.

86. 【2603.21695】RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing

链接https://arxiv.org/abs/2603.21695

作者:Yiming Shao,Qiyu Dai,Chong Gao,Guanbin Li,Yeqiang Wang,He Sun,Qiong Zeng,Baoquan Chen,Wenzheng Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:spatially varying optical, varying optical distortions, presents fundamental challenges, fundamental challenges due, surfaces presents fundamental

备注

点击查看摘要

Abstract:Novel view synthesis (NVS) through non-planar refractive surfaces presents fundamental challenges due to severe, spatially varying optical distortions. While recent representations like NeRF and 3D Gaussian Splatting (3DGS) excel at NVS, their assumption of straight-line ray propagation fails under these conditions, leading to significant artifacts. To overcome this limitation, we introduce RefracGS, a framework that jointly reconstructs the refractive water surface and the scene beneath the interface. Our key insight is to explicitly decouple the refractive boundary from the target objects: the refractive surface is modeled via a neural height field, capturing wave geometry, while the underlying scene is represented as a 3D Gaussian field. We formulate a refraction-aware Gaussian ray tracing approach that accurately computes non-linear ray trajectories using Snell's law and efficiently renders the underlying Gaussian field while backpropagating the loss gradients to the parameterized refractive surface. Through end-to-end joint optimization of both representations, our method ensures high-fidelity NVS and view-consistent surface recovery. Experiments on both synthetic and real-world scenes with complex waves demonstrate that RefracGS outperforms prior refractive methods in visual quality, while achieving 15x faster training and real-time rendering at 200 FPS. The project page for RefracGS is available at this https URL.

87. 【2603.21669】PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

链接https://arxiv.org/abs/2603.21669

作者:Yuheng Ji,Yuyang Liu,Huajie Tan,Xuchuan Huang,Fanding Huang,Yijie Xu,Cheng Chi,Yuting Zhao,Huaihai Lyu,Peterson Co,Mingyu Cao,Qiongyu Zhang,Zhe Li,Enshen Zhou,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Xiaolong Zheng

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:binary success rates, obscure critical qualities, Process Reward Models, Current robotic evaluation, collapse rich execution

备注

点击查看摘要

Abstract:Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.

88. 【2603.21664】HumanOmni-Speaker: Identifying Who said What and When

链接https://arxiv.org/abs/2603.21664

作者:Detao Bai,Shimin Yao,Weixuan Chen,Xihan Wei,Zhiheng Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:when. Current models, Current models suffer, Omni-modal Large Language, Large Language Models, multi-person conversational dynamics

备注

点击查看摘要

Abstract:While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

89. 【2603.21661】Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis

链接https://arxiv.org/abs/2603.21661

作者:Kangbo Zhao,Miaoxin Guan,Xiang Chen,Yukai Shi,Jinshan Pan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:low-level computer vision, autonomous driving systems, robust outdoor surveillance, computer vision, driving systems

备注: We aim at addressing the cross-scenario (i.e., O.O.D) de-rain challenge, which has been neglected for a long period

点击查看摘要

Abstract:Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.

90. 【2603.21660】OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

链接https://arxiv.org/abs/2603.21660

作者:Meilin Liu,Jiaying Wang,Jing Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image analysis, heterogeneous imaging modalities, collaborative medical image, remain tightly coupled, existing frameworks remain

备注: Accepted by CVPR 2026 (Main)

点击查看摘要

Abstract:Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.

91. 【2603.21647】FedCVU: Federated Learning for Cross-View Video Understanding

链接https://arxiv.org/abs/2603.21647

作者:Shenghan Zhang,Run Ling,Ke Cao,Ao Ma,Zhanjie Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:privacy-preserving multi-camera video, multi-camera video understanding, promising paradigm, paradigm for privacy-preserving, privacy-preserving multi-camera

备注

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.

92. 【2603.21638】No Dense Tensors Needed: Fully Sparse Object Detection on Event-Camera Voxel Grids

链接https://arxiv.org/abs/2603.21638

作者:Mohamad Yazan Sadoun,Sarah Sharif,Yaser Mike Banad

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cameras produce asynchronous, event-based detectors convert, Event cameras produce, sparse event stream, Event cameras

备注: 29 Pages, 9 Figures, 5 Tables

点击查看摘要

Abstract:Event cameras produce asynchronous, high-dynamic-range streams well suited for detecting small, fast-moving drones, yet most event-based detectors convert the sparse event stream into dense tensors, discarding the representational efficiency of neuromorphic sensing. We propose SparseVoxelDet, to our knowledge the first fully sparse object detector for event cameras, in which backbone feature extraction, feature pyramid fusion, and the detection head all operate exclusively on occupied voxel positions through 3D sparse convolutions; no dense feature tensor is instantiated at any stage of the pipeline. On the FRED benchmark (629,832 annotated frames), SparseVoxelDet achieves 83.38% mAP at 50 while processing only 14,900 active voxels per frame (0.23% of the T.H.W grid), compared to 409,600 pixels for the dense YOLOv11 baseline (87.68% mAP at 50). Relaxing the IoU threshold from 0.50 to 0.40 recovers mAP to 89.26%, indicating that the remaining accuracy gap is dominated by box regression precision rather than detection capability. The sparse representation yields 858 times GPU memory compression and 3,670 times storage reduction relative to the equivalent dense 3D voxel tensor, with data-structure size that scales with scene dynamics rather than sensor resolution. Error forensics across 119,459 test frames confirms that 71 percent of failures are localization near-misses rather than missed targets. These results demonstrate that native sparse processing is a viable paradigm for event-camera object detection, exploiting the structural sparsity of neuromorphic sensor data without requiring neuromorphic computing hardware, and providing a framework whose representation cost is governed by scene activity rather than pixel count, a property that becomes increasingly valuable as event cameras scale to higher resolutions.

93. 【2603.21629】Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition

链接https://arxiv.org/abs/2603.21629

作者:Wen Guo(1),Pengfei Zhao(1),Zongmeng Wang(4),Yufan Hu(2),Junyu Gao(3) ((1) Shandong Technology and Business University, (2) University of Science and Technology Beijing, (3) Institute of Automation, Chinese Academy of Sciences, (4) Inner Mongolia University)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiple Object Tracking, Object Tracking, computer vision, real-world scenarios, fundamental task

备注: Accepted by CVPR2026

点击查看摘要

Abstract:Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts. The code will be released at this https URL.

94. 【2603.21626】PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

链接https://arxiv.org/abs/2603.21626

作者:Jiacheng Lu,Hui Ding,Shiyu Zhang,Guoping Huo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Brain tumor MRI, radiotherapy target delineation, tumor MRI segmentation, enabling accurate lesion, accurate lesion detection

备注: This paper has been accepted to the main conference of CVPR 2026

点击查看摘要

Abstract:Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at this https URL.

95. 【2603.21619】Efficient Zero-Shot AI-Generated Image Detection

链接https://arxiv.org/abs/2603.21619

作者:Ryosuke Sonoda,Ramya Srinivasan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:posing significant challenges, images increasingly realistic, models has made, increasingly realistic, posing significant

备注

点击查看摘要

Abstract:The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free this http URL experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly $10\%$ compared to SoTA, while maintaining substantially lower computational cost.

96. 【2603.21618】4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video

链接https://arxiv.org/abs/2603.21618

作者:Jae Won Jang,Yeonjin Chang,Wonsik Shin,Juhwan Cho,Nojun Kwak

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:casual monocular video, dynamic object reconstruction, circ, dynamic object, monocular video

备注

点击查看摘要

Abstract:We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.

97. 【2603.21615】AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing

链接https://arxiv.org/abs/2603.21615

作者:Guandong Li,Zhaobin Chu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:text-guided image manipulation, flow matching models, Inversion-based image editing, Inversion-based image, flow matching

备注

点击查看摘要

Abstract:Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at this https URL

98. 【2603.21611】SARe: Structure-Aware Large-Scale 3D Fragment Reassembly

链接https://arxiv.org/abs/2603.21611

作者:Hanze Jia,Chunshi Wang,Yuxiao Yang,Zhonghua Jiang,Yawei Luo,Shuainan Ye,Tan Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unordered fragment point, fragment point clouds, common object coordinate, object coordinate system, fragment reassembly aims

备注: 18 pages, 4 figures

点击查看摘要

Abstract:3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.

99. 【2603.21597】A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

链接https://arxiv.org/abs/2603.21597

作者:Sheng Liu,Long Chen,Zeyun Zhao,Qinglin Gou,Qingyue Wei,Arjun Masurkar,Kevin M. Spiegler,Philip Kuball,Stefania C. Bray,Megan Bernath,Deanna R. Willis,Jiang Bian,Lei Xing,Eric Topol,Kyunghyun Cho,Yu Huang,Ruogu Fang,Narges Razavian,James Zou

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern clinical practice, practice increasingly depends, clinical practice increasingly, Modern clinical, reasoning over heterogeneous

备注

点击查看摘要

Abstract:Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra's potential for interpretable, robust decision support in clinical care.

100. 【2603.21584】SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

链接https://arxiv.org/abs/2603.21584

作者:Md Kaykobad Reza,Ameya Patil,Edward Ayrapetian,M. Salman Asif

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, large language models, large language, modalities, SSAM

备注: 25 Pages, 9 Figures, 5 Tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.

101. 【2603.21583】HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

链接https://arxiv.org/abs/2603.21583

作者:Mei Li,Huayi Zhou,Suizhi Huang,Yuxiang Lu,Yue Ding,Hongtao Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:virtual reality, challenging task, autonomous driving, robotic control, crucial yet challenging

备注: This is an accepted manuscript of an article published in Computer Vision and Image Understanding

点击查看摘要

Abstract:Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.

102. 【2603.21573】Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs

链接https://arxiv.org/abs/2603.21573

作者:Efthymios Tsaprazlis,Tiantian Feng,Anil Ramakrishna,Sai Praneeth Karimireddy,Rahul Gupta,Shrikanth Narayanan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visible sensitive content, benchmarks largely treat, Existing visual privacy, privacy benchmarks largely, largely treat privacy

备注

点击查看摘要

Abstract:Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.

103. 【2603.21566】CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation

链接https://arxiv.org/abs/2603.21566

作者:Mohammad Eslami,Dhanvinkumar Ganeshkumar,Saber Kazeminasab,Michael G. Morley,Michael V. Boland,Michael M. Lin,John B. Miller,David S. Friedman,Nazlee Zebardast,Lucia Sobrin,Tobias Elze

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:extension of Meta, real-time semantic segmentation, cataract ophthalmic surgery, Meta Segment, high accuracy

备注

点击查看摘要

Abstract:We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.

104. 【2603.21565】Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance

链接https://arxiv.org/abs/2603.21565

作者:Yansong Lin,Zihan Cheng,Jielei Wang,Guoming Lua,Zongyong Cui

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Synthetic aperture radar, aperture radar automatic, SAR ATR, Synthetic aperture, radar automatic target

备注

点击查看摘要

Abstract:Synthetic aperture radar automatic target recognition (SAR ATR) is of considerable importance in marine navigation and disaster monitoring. However, the coherent speckle noise inherent in SAR imagery often obscures salient target features, leading to degraded recognition accuracy and limited model generalization. To address this issue, this paper proposes a target-aware frequency-spatial enhancement framework with noise-resilient knowledge guidance (FSCE) for SAR target recognition. The proposed framework incorporates a frequency-spatial shallow feature adaptive enhancement (DSAF) module, which processes shallow features through spatial multi-scale convolution and frequency-domain wavelet convolution. In addition, a teacher-student learning paradigm combined with an online knowledge distillation method (KD) is employed to guide the student network to focus more effectively on target regions, thereby enhancing its robustness to high-noise backgrounds. Through the collaborative optimization of attention transfer and noise-resilient representation learning, the proposed approach significantly improves the stability of target recognition under noisy conditions. Based on the FSCE framework, two network architectures with different performance emphases are developed: lightweight DSAFNet-M and high-precision DSAFNet-L. Extensive experiments are conducted on the MSTAR, FUSARShip and OpenSARShip datasets. The results show that DSAFNet-L achieves competitive or superior performance compared with various methods on three datasets; DSAFNet-M significantly reduces the model complexity while maintaining comparable accuracy. These results indicate that the proposed FSCE framework exhibits strong cross-model generalization.

105. 【2603.21562】Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection

链接https://arxiv.org/abs/2603.21562

作者:Mingle Zhou,Jiahui Liu,Jin Wan,Gang Li,Min Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Continuous Anomaly Detection, Unsupervised Continuous Anomaly, heavy computational burden, computational burden issues, traditional Unsupervised Anomaly

备注

点击查看摘要

Abstract:Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.

106. 【2603.21559】Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

链接https://arxiv.org/abs/2603.21559

作者:Minseok Kang,Minhyeok Lee,Minjung Kim,Jungho Lee,Donghyeong Kim,Sungmin Woo,Inseok Jeon,Sangyoun Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly-supervised video scene, significantly reducing annotation, reducing annotation costs, sparse temporal labeling, video scene graph

备注: 28 pages, 11 figures

点击查看摘要

Abstract:Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy this http URL address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.

107. 【2603.21557】From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy

链接https://arxiv.org/abs/2603.21557

作者:Bi'an Du,Daizong Liu,Pufan Li,Wei Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generation lies, Abstract, structural, Single-image, real world

备注: Accepted to ICME 2026

点击查看摘要

Abstract:Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.

108. 【2603.21547】PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models

链接https://arxiv.org/abs/2603.21547

作者:Yiwei Xie,Zheng Zhang,Ping Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion models report, models report substantial, report substantial suppression, sensitive content, report substantial

备注: This preprint was posted after submission to IEEE Transactions

点击查看摘要

Abstract:Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at this https URL.

109. 【2603.21528】PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation

链接https://arxiv.org/abs/2603.21528

作者:Gensheng Pei,Xiruo Jiang,Xinhao Cai,Tao Chen,Yazhou Yao,Byeungwoo Jeon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:open-vocabulary semantic segmentation, promises rapid adaptation, Training-free open-vocabulary semantic, semantic segmentation, promises rapid

备注: accepted by CVPR 2026

点击查看摘要

Abstract:Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf{\underline{P}}rocrust\textbf{\underline{e}}s \textbf{\underline{a}}lignment with text-awa\textbf{\underline{r}}e \textbf{\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.

110. 【2603.21526】VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

链接https://arxiv.org/abs/2603.21526

作者:Xinghan Li,Junhao Xu,Jingjing Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, interpretable deepfake detection, generating textual explanations, Multimodal large, large language models

备注: Project Page: [this https URL](https://vigil.best)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.

111. 【2603.21511】Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

链接https://arxiv.org/abs/2603.21511

作者:Kaiqiang Li,Gang Li,Mingle Zhou,Min Li,Delong Han,Jin Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reliable industrial inspection, target-category training data, industrial inspection, crucial for reliable, reliable industrial

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{this https URL}{this https URL}.

112. 【2603.21504】Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification

链接https://arxiv.org/abs/2603.21504

作者:Jayanie Bogahawatte,Sachith Seneviratne,Saman Halgamuge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Slide Images, WSI classification pipelines, WSI classification, supervised WSI classification, giga-pixel in scale

备注: Accepted for publication at CVPR 2026 Workshop on Medical Reasoning with Vision Language Foundation Models (Med-Reasoner)

点击查看摘要

Abstract:Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at this https URL.

113. 【2603.21493】StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding

链接https://arxiv.org/abs/2603.21493

作者:Guowei Tang,Tianwen Qian,Huanran Zheng,Yifei Wang,Xiaoling Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:fundamental system-level challenge, system-level challenge, signals is essential, essential for real-world, real-world interactive

备注

点击查看摘要

Abstract:Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at this https URL.

114. 【2603.21488】Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation

链接https://arxiv.org/abs/2603.21488

作者:Jingnan Luo,Mingqi Gao,Jun Liu,Bin-Bin Gao,Feng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注

点击查看摘要

Abstract:The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at this https URL.

115. 【2603.21484】Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models

链接https://arxiv.org/abs/2603.21484

作者:Hyundong Jin,Dongyoon Han,Eunwoo Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:selectively refuse specific, refuse specific image-instruction, specific image-instruction pairs, enabling large vision-language, large vision-language models

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.

116. 【2603.21482】ALADIN:Attribute-Language Distillation Network for Person Re-Identification

链接https://arxiv.org/abs/2603.21482

作者:Wang Zhou,Boran Duan,Haojun Ai,Ruiqi Lan,Ziyue Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent vision-language models, Recent vision-language, current CLIP-guided ReID, CLIP-guided ReID pipelines, ReID pipelines rely

备注: 14pages, 3figures, 7charts

点击查看摘要

Abstract:Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.

117. 【2603.21463】EpiMask: Leveraging Epipolar Distance Based Masks in Cross-Attention for Satellite Image Matching

链接https://arxiv.org/abs/2603.21463

作者:Rahul Deshmukh,Aditya Chauhan,Avinash Kak

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:handle significantly larger, significantly larger variations, providing matched pairs, deep-learning based image, sub-pixel precision

备注

点击查看摘要

Abstract:The deep-learning based image matching networks can now handle significantly larger variations in viewpoints and illuminations while providing matched pairs of pixels with sub-pixel precision. These networks have been trained with ground-based image datasets and, implicitly, their performance is optimized for the pinhole camera geometry. Consequently, you get suboptimal performance when such networks are used to match satellite images since those images are synthesized as a moving satellite camera records one line at a time of the points on the ground. In this paper, we present EpiMask, a semi-dense image matching network for satellite images that (1) Incorporates patch-wise affine approximations to the camera modeling geometry; (2) Uses an epipolar distance-based attention mask to restrict cross-attention to geometrically plausible regions; and (3) That fine-tunes a foundational pretrained image encoder for robust feature extraction. Experiments on the SatDepth dataset demonstrate up to 30% improvement in matching accuracy compared to re-trained ground-based models.

118. 【2603.21436】PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences

链接https://arxiv.org/abs/2603.21436

作者:Lanbo Xu,Liang Guo,Caigui Jiang,Cheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains fundamentally limited, enables dense scene, dense scene recovery, previously accumulated scene, accumulated scene structure

备注

点击查看摘要

Abstract:Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.

119. 【2603.21432】Image-Based Structural Analysis Using Computer Vision and LLMs: PhotoBeamSolver

链接https://arxiv.org/abs/2603.21432

作者:Altamirano-Muñiz Emilio Fernando

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:idealized beam models, solving idealized beam, documented program capable, beam models, academic exercises

备注: 10 pages

点击查看摘要

Abstract:This paper presents the development of a documented program capable of solving idealized beam models, such as those commonly used in textbooks and academic exercises, from drawings made by a person. The system is based on computer vision and statistical learning techniques for the detection and visual interpretation of structural elements. Likewise, the main challenges and limitations associated with the integration of computer vision into structural analysis are analyzed, as well as the requirements necessary for its reliable application in the field of civil engineering. In this context, the implementation of the PhotoBeamSolver program is explored, and the current state of computer vision in civil engineering is discussed, particularly in relation to structural analysis, infrastructure inspection, and engineering decision-support systems.

120. 【2603.21426】Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

链接https://arxiv.org/abs/2603.21426

作者:Jingchen Sun,Shaobo Han,Deep Patel,Wataru Kohno,Can Jin,Changyou Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Knowledge distillation establishes, paradigm that leverages, Knowledge distillation, teacher, Beta-weighted Knowledge Distillation

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at this https URL.

121. 【2603.21387】Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER

链接https://arxiv.org/abs/2603.21387

作者:Feng Xu,Xun Li,Lars Petersson,Yulei Sui,David Ahmedt Aristizabal,Dadong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial expression recognition, Facial expression, facial data, inherently expose identity, significant privacy concerns

备注: ICME 2026, Accepted

点击查看摘要

Abstract:Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.

122. 【2603.21386】Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation

链接https://arxiv.org/abs/2603.21386

作者:Nikolay Kormushev,Josip Šarić,Matej Kristan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited regional understanding, segmentation remains hindered, closed vocabularies suppress, mask selection bias, panoptic segmentation remains

备注

点击查看摘要

Abstract:Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP's region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: this https URL

123. 【2603.21378】An InSAR Phase Unwrapping Framework for Large-scale and Complex Events

链接https://arxiv.org/abs/2603.21378

作者:Yijia Song,Juliet Biggs,Alin Achim,Robert Popescu,Simon Orrego,Nantheera Anantrasirichai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)

关键词:complex deformation patterns, involving complex deformation, scenarios involving complex, remains a critical, involving complex

备注

点击查看摘要

Abstract:Phase unwrapping remains a critical and challenging problem in InSAR processing, particularly in scenarios involving complex deformation patterns. In earthquake-related deformation, shallow sources can generate surface-breaking faults and abrupt displacement discontinuities, which severely disrupt phase continuity and often cause conventional unwrapping algorithms to fail. Another limitation of existing learning-based unwrapping methods is their reliance on fixed and relatively small input sizes, while real InSAR interferograms are typically large-scale and spatially heterogeneous. This mismatch restricts the applicability of many neural network approaches to real-world data. In this work, we present a phase unwrapping framework based on a diffusion model, developed to process large-scale interferograms and to address phase discontinuities caused by deformation. By leveraging a diffusion model architecture, the proposed method can recover physically consistent unwrapped phase fields even in the presence of fault-related phase jumps. Experimental results on both synthetic and real datasets demonstrate that the method effectively addresses discontinuities associated with near-surface deformation and scales well to large InSAR images, offering a practical alternative to manual unwrapping in challenging scenarios.

124. 【2603.21377】HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis

链接https://arxiv.org/abs/2603.21377

作者:Mohamed A Mabrok

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:fundamental building block, structured inductive bias, damped harmonic oscillator, signal processing, damped harmonic

备注

点击查看摘要

Abstract:We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC\,2018 (89.38\%), ISIC\,2017 (88.40\%), TN3K (87.05\%), and ACDC (92.40\%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85\%) and PathMNIST (96.65\%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior$\,{}\,$boundary$\,{}\,$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at this https URL.

125. 【2603.21366】Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

链接https://arxiv.org/abs/2603.21366

作者:Zengqun Zhao,Yanzuo Lu,Ziquan Liu,Jifei Song,Jiankang Deng,Ioannis Patras

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling causal synthesis, enabling causal, Relax Forcing, recently emerged, promising paradigm

备注: Project page: see [this https URL](https://zengqunzhao.github.io/Relax-Forcing)

点击查看摘要

Abstract:Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.

126. 【2603.21356】FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction

链接https://arxiv.org/abs/2603.21356

作者:Yuqiu Liu,Jialin Song,Marissa Ramirez de Chanlatte,Rochishnu Chowdhury,Rushil Paresh Desai,Wuyang Chen,Daniel Martin,Michael Mahoney

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:physical world follow, world follow physical, follow physical laws, Real objects, world follow

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available at this https URL.

127. 【2603.21349】Respiratory Status Detection with Video Transformers

链接https://arxiv.org/abs/2603.21349

作者:Thomas Savage,Evan Madill

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:saving clinical skill, life saving clinical, clinical skill, visual inspection, life saving

备注

点击查看摘要

Abstract:Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.

128. 【2603.21348】Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution

链接https://arxiv.org/abs/2603.21348

作者:Yu-Shan Tai,An-Yeu(Andy)Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made significant strides, Step Sequence Redistribution, time step sequence, step sequence, time step

备注

点击查看摘要

Abstract:Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.

129. 【2603.21332】EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization

链接https://arxiv.org/abs/2603.21332

作者:Haolan Xu,Keli Cheng,Lei Wang,Ning Bi,Xiaoming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Radiance Fields, Neural Radiance, Gaussian Splatting, rapidly with Neural

备注: Accepted by CVPR 2026. Page: [this https URL](https://emotag26.github.io/)

点击查看摘要

Abstract:Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.

130. 【2603.21327】KHMP: Frequency-Domain Kalman Refinement for High-Fidelity Human Motion Prediction

链接https://arxiv.org/abs/2603.21327

作者:Wenhan Wu,Zhishuai Guo,Chen Chen,Srijan Das,Hongfei Xue,Pu Wang,Aidong Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Stochastic human motion, Stochastic human, observed sequences, futures from observed, Stochastic

备注

点击查看摘要

Abstract:Stochastic human motion prediction aims to generate diverse, plausible futures from observed sequences. Despite advances in generative modeling, existing methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities. To address these challenges, we introduce KHMP, a novel framework featuring an adaptiveKalman filter applied in the DCT domain to generate high-fidelity human motion predictions. By treating high-frequency DCT coefficients as a frequency-indexed noisy signal, the Kalman filter recursively suppresses noise while preserving motion details. Notably, its noise parameters are dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR), enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This refinement is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles into the generative model. Together, these innovations establish a new paradigm integrating adaptive signal processing with physics-informed learning. Experiments on the Human3.6M and HumanEva-I datasets demonstrate that KHMP achieves state-of-the-art accuracy, effectively mitigating jitter artifacts to produce smooth and physically plausible motions.

131. 【2603.21309】st-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos

链接https://arxiv.org/abs/2603.21309

作者:Masoumeh Sharafi,Muhammad Osama Zeeshan,Soufiane Belharbi,Alessandro Lameiras Koerich,Marco Pedersoli,Eric Granger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial expression recognition, Facial expression, TTA methods, TTA, TTA methods rely

备注

点击查看摘要

Abstract:Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.

132. 【2603.21305】Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication

链接https://arxiv.org/abs/2603.21305

作者:Idris Zakariyya,Pai Chet Ng,Kaushik Bhargav Sivangi,S. Mohammad Sheikholeslami,Konstantinos N. Plataniotis,Fani Deligianni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables collaborative model, recognition enables collaborative, raw video data, action recognition enables, textit

备注

点击查看摘要

Abstract:Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99\% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2\% higher accuracy} under strict privacy ($\epsilon=0.65$) in centralized settings and \textbf{48\% faster training} with \textbf{73.1\% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at this https URL

133. 【2603.21304】F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

链接https://arxiv.org/abs/2603.21304

作者:Injae Kim,Chaehyeon Kim,Minseong Bae,Minseok Joo,Hyunwoo J. Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enable single-pass reconstruction, methods enable single-pass, Splatting methods enable, Gaussian Splatting, Gaussian Splatting methods

备注: Project Page: $\href{ [this https URL](https://mlvlab.github.io/F4Splat) }{\text{this http URL}}$

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.

134. 【2603.21299】Identity-Consistent Video Generation under Large Facial-Angle Variations

链接https://arxiv.org/abs/2603.21299

作者:Bin Hu,Zipeng Qi,Guoxi Huang,Zunnan Xu,Ruicheng Zhang,Chongjie Ye,Jun Zhou,Xiu Li,Jingdong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Single-view, struggle to preserve, naturalness, motion naturalness, identity

备注

点击查看摘要

Abstract:Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.

135. 【2603.21295】xt-Image Conditioned 3D Generation

链接https://arxiv.org/abs/2603.21295

作者:Jiazhong Cen,Jiemin Fang,Sikuang Li,Guanjun Wu,Chen Yang,Taoran Yi,Zanwei Zhou,Zhikuan Bao,Lingxi Xie,Wei Shen,Qi Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:motivating growing interest, industrial design, assets are essential, motivating growing, growing interest

备注: CVPR 2026. Project page: [this https URL](https://jumpat.github.io/tigon-page) Code: [this https URL](https://github.com/Jumpat/tigon)

点击查看摘要

Abstract:High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: this https URL

136. 【2603.21289】When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

链接https://arxiv.org/abs/2603.21289

作者:Zhengxian Wu,Kai Shi,Chuanrui Zhang,Zirui Liao,Jun Yang,Ni Yang,Qiuying Peng,Luyuan Zhang,Hangrui Xu,Tianhuang Su,Zhenyu Yang,Haonan Lu,Haoqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent progress, improvements largely rely, multimodal large language, large language models, high-quality annotated data

备注: 21 pages, 7 figures

点击查看摘要

Abstract:Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models. The code are available at this https URL.

137. 【2603.21287】Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting

链接https://arxiv.org/abs/2603.21287

作者:Yuntian Bo,Yazhou Zhu,Piotr Koniusz,Haofeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conventional few-shot medical, broader clinical applicability, hinder broader clinical, Conventional few-shot, approaches face performance

备注: Accepted by CVPR26

点击查看摘要

Abstract:Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM's over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at this https URL.

138. 【2603.21284】Sonny: Breaking the Compute Wall in Medium-Range Weather Forecasting

链接https://arxiv.org/abs/2603.21284

作者:Minjong Cheon

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)

关键词:high-impact atmospheric events, data-driven weather forecasting, Weather forecasting, fundamental problem, problem for protecting

备注

点击查看摘要

Abstract:Weather forecasting is a fundamental problem for protecting lives and infrastructure from high-impact atmospheric events. Recently, data-driven weather forecasting methods based on deep learning have demonstrated strong performance, often reaching accuracy levels competitive with operational numerical systems. However, many existing models rely on large-scale training regimes and compute-intensive architectures, which raises the practical barrier for academic groups with limited compute resources. Here we introduce Sonny, an efficient hierarchical transformer that achieves competitive medium-range forecasting performance while remaining feasible within reasonable compute budgets. At the core of Sonny is a two-stage StepsNet design: a narrow slow path first models large-scale atmospheric dynamics, and a subsequent full-width fast path integrates thermodynamic interactions. To stabilize medium-range rollout without an additional fine-tuning stage, we apply exponential moving average (EMA) during training. On WeatherBench2, Sonny yields robust medium-range forecast skill, remains competitive with operational baselines, and demonstrates clear advantages over FastNet, particularly at extended tropical lead times. In practice, Sonny can be trained to convergence on a single NVIDIA A40 GPU in approximately 5.5 days.

139. 【2603.21245】CornOrb: A Multimodal Dataset of Orbscan Corneal Topography and Clinical Annotations for Keratoconus Detection

链接https://arxiv.org/abs/2603.21245

作者:Mohammed El Amine Lazouni,Leila Ryma Lazouni,Zineb Aziza Elaouaber,Mohammed Ammar,Sofiane Zehar,Mohammed Youcef Bouayad Agha,Ahmed Lazouni,Amel Feroui,Ali H. Al-Timemy,Siamak Yousefi,Mostafa El Habib Daho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Orbscan corneal topography, corneal topography images, clinical annotations collected, publicly accessible multimodal, Orbscan corneal

备注: Preprint, 9 pages, 4 figures, dataset paper. Corresponding author: [this http URL](http://mostafa.elhabibdaho) @univ [this http URL](http://-brest.fr)

点击查看摘要

Abstract:In this paper, we present CornOrb, a publicly accessible multimodal dataset of Orbscan corneal topography images and clinical annotations collected from patients in Algeria. The dataset comprises 1,454 eyes from 744 patients, including 889 normal eyes and 565 keratoconus cases. For each eye, four corneal maps are provided (axial curvature, anterior elevation, posterior elevation, and pachymetry), together with structured tabular data including demographic information and key clinical parameters such as astigmatism, maximum keratometry (Kmax), central and thinnest pachymetry, and anterior/posterior asphericity. All data were retrospectively acquired, fully anonymized, and pre-processed into standardized PNG and CSV formats to ensure direct usability for artificial intelligence research. This dataset represents one of the first large-scale Orbscan-based resources from Africa, specifically built to enable robust AI-driven detection and analysis of keratoconus using multimodal data. The data are openly available at Zenodo.

Comments:
Preprint, 9 pages, 4 figures, dataset paper. Corresponding author: this http URL@univthis http URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.21245 [cs.CV]

(or
arXiv:2603.21245v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.21245

Focus to learn more

              arXiv-issued DOI via DataCite</p>
140. 【2603.21234】Enhancing Brain Tumor Classification Using Vision Transformers with Colormap-Based Feature Representation on BRISC2025 Dataset

链接https://arxiv.org/abs/2603.21234

作者:Faisal Ahmed

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:magnetic resonance imaging, effective treatment planning, brain tumor classification, resonance imaging, plays a critical

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Accurate classification of brain tumors from magnetic resonance imaging (MRI) plays a critical role in early diagnosis and effective treatment planning. In this study, we propose a deep learning framework based on Vision Transformers (ViT) enhanced with colormap-based feature representation to improve multi-class brain tumor classification performance. The proposed approach leverages the ability of transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize important structural and intensity variations within MRI scans. Experiments are conducted on the BRISC2025 dataset, which includes four classes: glioma, meningioma, pituitary tumor, and non-tumor cases. The model is trained and evaluated using standard performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed method achieves a classification accuracy of 98.90%, outperforming baseline convolutional neural network models including ResNet50, ResNet101, and EfficientNetB2. In addition, the model demonstrates strong generalization capability with an AUC of 99.97%, indicating high discriminative performance across all classes. These results highlight the effectiveness of combining Vision Transformers with colormap-based feature enhancement for accurate and robust brain tumor classification and suggest strong potential for clinical decision support applications.

Comments:
11 pages, 3 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.21234 [cs.CV]

(or
arXiv:2603.21234v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.21234

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
141. 【2603.21233】DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture

链接https://arxiv.org/abs/2603.21233

作者:Young-Seo Chang,Yatong An,Jae-Sang Hyun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:depth map compression, DepthTCM maps depth, high-bit depth map, depth map, map compression

备注

点击查看摘要

Abstract:We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.

142. 【2603.21232】QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression

链接https://arxiv.org/abs/2603.21232

作者:Zhongyang Li,Yaqian Li,Faming Fang,Rinyoichi Takezoe,Zi-Hao Bo,Cheng Qian,Mo Guang,Guixu Zhang,Kaiwen Long

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, large language models, language models suffer, Multimodal large, Query Guided Router

备注

点击查看摘要

Abstract:Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.

143. 【2603.21229】Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

链接https://arxiv.org/abs/2603.21229

作者:Jinyu Xu,Tianqi Hu,Xiaonan Hu,Letian Zhou,Songliang Cao,Meng Zhang,Hao Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural world requires, world requires pushing, detailed visual classification, Visually cataloging, cataloging and quantifying

备注: Accepted by CVPR 2026. Project page: [this https URL](https://github.com/tiny-smart/TPC-268)

点击查看摘要

Abstract:Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom - species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at this https URL.

144. 【2603.21222】A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification

链接https://arxiv.org/abs/2603.21222

作者:Ting Han,Xiangyi Xie,Yiping Chen,Yumeng Du,Jin Ma,Aiguang Li,Jiaan Liu,Yin Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing imagery, large-scale hierarchical road, automatic multi-grade road, framework for automatic, large-scale hierarchical

备注

点击查看摘要

Abstract:In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.

145. 【2603.21217】Reframing Long-Tailed Learning via Loss Landscape Geometry

链接https://arxiv.org/abs/2603.21217

作者:Shenghan Chen,Yiming Liu,Yanzhen Wang,Yujia Wang,Xiankai Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Balancing performance trade-off, data distributions remains, Balancing performance, trade-off on long-tail, data distributions

备注: Accepted to CVPR 2026. 11 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:this https URL.

146. 【2603.21213】Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis

链接https://arxiv.org/abs/2603.21213

作者:Tian Xia,Matthew Sinclair,Andreas Schuh,Fabio De Sousa Ribeiro,Raghav Mehta,Rajat Rasal,Esther Puyol-Antón,Samuel Gerber,Kersten Petersen,Michiel Schaap,Ben Glocker

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:controlled data augmentation, image generation enables, generation enables controlled, enables controlled data, Counterfactual image generation

备注

点击查看摘要

Abstract:Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.

147. 【2603.21208】JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

链接https://arxiv.org/abs/2603.21208

作者:Haolun Zheng,Yu He,Tailun Chen,Shuo Shao,Zhixuan Chu,Hongbin Zhou,Lan Tao,Zhan Qin,Kui Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:DALLE remain susceptible, DALLE remain, deployed safety filters, Stable Diffusion, remain susceptible

备注: 18 pages, 8 figures

点击查看摘要

Abstract:Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

148. 【2603.21206】Boundary-Aware Instance Segmentation in Microscopy Imaging

链接https://arxiv.org/abs/2603.21206

作者:Thomas Mendelson,Joshua Francois,Galit Lahav,Tammy Riklin-Raviv

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:studying cellular dynamics, Accurate delineation, overlapping instances remains, cellular dynamics, persistent challenge

备注: Accepted for publication in IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: this https URL

Comments:
Accepted for publication in IEEE International Symposium on Biomedical Imaging (ISBI) 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.21206 [cs.CV]

(or
arXiv:2603.21206v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.21206

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
149. 【2603.21192】DSCSNet: A Dynamic Sparse Compression Sensing Network for Closely-Spaced Infrared Small Target Unmixing

链接https://arxiv.org/abs/2603.21192

作者:Zhiyang Tang,Yiming Zhu,Ruimin Huang,Meng Yang,Yong Ma,Jun Huang,Fan Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:optical lens focal, lens focal length, Close Small Object, distant clustered infrared, Small Object Unmixing

备注: 13 pages, 8 figures

点击查看摘要

Abstract:Due to the limitations of optical lens focal length and detector resolution, distant clustered infrared small targets often appear as mixed spots. The Close Small Object Unmixing (CSOU) task aims to recover the number, sub-pixel positions, and radiant intensities of individual targets from these spots, which is a highly ill-posed inverse problem. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches and the dynamic scene adaptability of data-driven methods. To address this dilemma, this paper proposes a Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples the Alternating Direction Method of Multipliers (ADMM) with learnable parameters. Specifically, we embed a strict $\ell_1$-norm sparsity constraint into the auxiliary variable update step of ADMM to replace the traditional $\ell_2$-norm smoothness-promoting terms, which effectively preserves the discrete energy peaks of small targets. We also integrate a self-attention-based dynamic thresholding mechanism into the reconstruction stage, which adaptively adjusts the sparsification intensity using the sparsity-enhanced information from the iterative process. These modules are jointly optimized end-to-end across the three iterative steps of ADMM. Retaining the physical logic of compressed sensing, DSCSNet achieves robust sparsity induction and scene adaptability, thus enhancing the unmixing accuracy and generalization in complex infrared scenarios. Extensive experiments on the synthetic infrared dataset CSIST-100K demonstrate that DSCSNet outperforms state-of-the-art methods in key metrics such as CSO-mAP and sub-pixel localization error.

150. 【2603.21176】GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

链接https://arxiv.org/abs/2603.21176

作者:Zifeng Zhu,Jiaming Han,Jiaxiang Zhao,Minnan Luo,Xiangyu Yue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Large Language, Large Language Models, Large Language, demonstrated remarkable capabilities, Diffusion Large

备注: 25 pages, 7 figures

点击查看摘要

Abstract:While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.

151. 【2603.21166】raining-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images

链接https://arxiv.org/abs/2603.21166

作者:Jiatong Xia,Lingqiao Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unposed RGB images, RGB images, unposed RGB, set of unposed, training-free system

备注: Accepted by SIGGRAPH Asia 2025

点击查看摘要

Abstract:We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: this https URL

152. 【2603.21165】Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

链接https://arxiv.org/abs/2603.21165

作者:Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi,Shubhashis Roy Dipta,Rubaya Tabassum,Ariful Ekraj Hridoy,Mehraj Mahmood,Mahbub E Sobhani,Md. Tarek Hasan,Swakkhar Shatabda

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:everyday visual life, expressed through region, richly expressed, historically linked languages, Bengali culture

备注: [this https URL](https://labib1610.github.io/BanglaVerse/)

点击查看摘要

Abstract:Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

153. 【2603.21160】Beyond a Single Signal: SPECTREG2, A Unified MultiExpert Anomaly Detector for Unknown Unknowns

链接https://arxiv.org/abs/2603.21160

作者:Rahul D Ray

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Epistemic intelligence requires, intelligence requires machine, requires machine learning, machine learning systems, Epistemic intelligence

备注

点击查看摘要

Abstract:Epistemic intelligence requires machine learning systems to recognise the limits of their own knowledge and act safely under uncertainty, especially when faced with unknown unknowns. Existing uncertainty quantification methods rely on a single signal such as confidence or density and fail to detect diverse structural anomalies. We introduce SPECTRE-G2, a multi-signal anomaly detector that combines eight complementary signals from a dual-backbone neural network. The architecture includes a spectral normalised Gaussianization encoder, a plain MLP preserving feature geometry, and an ensemble of five models. These produce density, geometry, uncertainty, discriminative, and causal signals. Each signal is normalised using validation statistics and calibrated with synthetic out-of-distribution data. An adaptive top-k fusion selects the most informative signals and averages their scores. Experiments on synthetic, Adult, CIFAR-10, and Gridworld datasets show strong performance across diverse anomaly types, outperforming multiple baselines on AUROC, AUPR, and FPR95. The model is stable across seeds and particularly effective for detecting new variables and confounders. SPECTRE-G2 provides a practical approach for detecting unknown unknowns in open-world settings.

154. 【2603.21138】Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

链接https://arxiv.org/abs/2603.21138

作者:Wenjin Hou,Xiaoxiao Sun,Hehe Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, demonstrated the potential, ZSL, generative ZSL, Recent

备注

点击查看摘要

Abstract:Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.

155. 【2603.21136】MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

链接https://arxiv.org/abs/2603.21136

作者:Pengxiang Cai,Mengyang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interactions remains challenging, advanced significantly, remains challenging, customizing scenes, interactions remains

备注

点击查看摘要

Abstract:Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.

156. 【2603.21135】One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation

链接https://arxiv.org/abs/2603.21135

作者:Yu-Wen Tseng,Xingyi Zheng,Ya-Chen Wu,I-Bin Liao,Yung-Hui Li,Hong-Han Shuai,Wen-Huang Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:adapts pre-trained models, unlabeled test data, adapts pre-trained, pre-trained models, models to distribution

备注: 14 pages, 6 figures

点击查看摘要

Abstract:Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.

157. 【2603.21134】Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition

链接https://arxiv.org/abs/2603.21134

作者:Zhiyan Cao,Zhengxi Wu,Yiwei Wang,Pei-Hsuan Lin,Li Zhang,Zhen Xie,Huan Zhao,Han Ding

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:cardiovascular disease assessment, remains highly operator-dependent, views remains highly, autonomous probe adjustment, acquiring standard views

备注: Accepted for publication at the IEEE ICRA 2026. 8 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.

158. 【2603.21129】ReDiffuse: Rotation Equivariant Diffusion Model for Multi-focus Image Fusion

链接https://arxiv.org/abs/2603.21129

作者:Bo Li,Tingting Bao,Lingling Zhang,Weiping Fu,Yaxian Wang,Jun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved impressive performance, multi-focus image fusion, achieved impressive, Diffusion models, MFIF

备注: 10 pages, 9 figures

点击查看摘要

Abstract:Diffusion models have achieved impressive performance on multi-focus image fusion (MFIF). However, a key challenge in applying diffusion models to the ill-posed MFIF problem is that defocus blur can make common symmetric geometric structures (e.g., textures and edges) appear warped and deformed, often leading to unexpected artifacts in the fused images. Therefore, embedding rotation equivariance into diffusion networks is essential, as it enables the fusion results to faithfully preserve the original orientation and structural consistency of geometric patterns underlying the input images. Motivated by this, we propose ReDiffuse, a rotation-equivariant diffusion model for MFIF. Specifically, we carefully construct the basic diffusion architectures to achieve end-to-end rotation equivariance. We also provide a rigorous theoretical analysis to evaluate its intrinsic equivariance error, demonstrating the validity of embedding equivariance structures. ReDiffuse is comprehensively evaluated against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, and Road-MF). Results demonstrate that ReDiffuse achieves competitive performance, with improvements of 0.28-6.64\% across six evaluation metrics. The code is available at this https URL.

159. 【2603.21115】LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation

链接https://arxiv.org/abs/2603.21115

作者:Xiaoshan Wu,Xiaoyang Lyu,Yifei Yu,Bo Wang,Zhongrui Wang,Xiaojuan Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:creates critical perceptual, critical perceptual gaps, Interframe Semantic Segmentation, Dense semantic segmentation, Anytime Interframe Semantic

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: this https URL Code: this https URL.

160. 【2603.21114】CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs

链接https://arxiv.org/abs/2603.21114

作者:Shanmukha Vellamcheti,Uday Kiran Kothapalli,Disharee Bhowmick,Sathyanarayanan N. Aakur

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, achieve strong performance, large language models, Multimodal large, maintain stable spatial

备注: 28 pages, 10 figures, 3 tables. Project page: [this https URL](https://shanmukha-here.github.io/CVT-Bench)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.

161. 【2603.21111】Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning

链接https://arxiv.org/abs/2603.21111

作者:Shih-Wen Liu,Yen-Chang Chen,Wei-Ta Chu,Fu-En Yang,Yu-Chiang Frank Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:methods remain largely, remain largely limited, parameter-efficient multi-task learning, multiple tasks efficiently, solve multiple tasks

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbf{Free Sinewich}, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbf{Free}). Specifically, a \textbf{Sine-AWB (Sinewich)} layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39\% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Project page: \href{this https URL}{this https URL}.

162. 【2603.21104】CounterScene: Counterfactual Causal Reasoning in Generative World Models for Safety-Critical Closed-Loop Evaluation

链接https://arxiv.org/abs/2603.21104

作者:Bowen Jing,Ruiyang Hao,Weitao Zhou,Haibao Yu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Generating safety-critical driving, driving scenarios requires, scenarios requires understanding, dangerous interactions arise, Generating safety-critical

备注: 28 pages, 7 figures

点击查看摘要

Abstract:Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism--adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.

163. 【2603.21100】Learning Progressive Adaptation for Multi-Modal Tracking

链接https://arxiv.org/abs/2603.21100

作者:He Wang,Tianyang Xu,Zhangyong Tang,Xiao-Jun Wu,Josef Kittler

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:parameter-efficient fine-tuning modules, adopting pre-trained RGB, pre-trained RGB models, RGB pre-trained models, RGB pre-trained

备注

点击查看摘要

Abstract:Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at this https URL.

164. 【2603.21095】Representation-Level Adversarial Regularization for Clinically Aligned Multitask Thyroid Ultrasound Assessment

链接https://arxiv.org/abs/2603.21095

作者:Dina Salama,Mohamed Mahmoud,Nourhan Bayasi,David Liu,Ilker Hacihaliloglu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:assessing thyroid nodules, Thyroid ultrasound, assessing thyroid, biopsy is warranted, thyroid nodules

备注

点击查看摘要

Abstract:Thyroid ultrasound is the first-line exam for assessing thyroid nodules and determining whether biopsy is warranted. In routine reporting, radiologists produce two coupled outputs: a nodule contour for measurement and a TI-RADS risk category based on sonographic criteria. Yet both contouring style and risk grading vary across readers, creating inconsistent supervision that can degrade standard learning pipelines. In this paper, we address this workflow with a clinically guided multitask framework that jointly predicts the nodule mask and TI-RADS category within a single model. To ground risk prediction in clinically meaningful evidence, we guide the classification embedding using a compact TI-RADS aligned radiomics target during training, while preserving complementary deep features for discriminative performance. However, under annotator variability, naive multitask optimization often fails not because the tasks are unrelated, but because their gradients compete within the shared representation. To make this competition explicit and controllable, we introduce RLAR, a representation-level adversarial gradient regularizer. Rather than performing parameter-level gradient surgery, RLAR uses each task's normalized adversarial direction in latent space as a geometric probe of task sensitivity and penalizes excessive angular alignment between task-specific adversarial directions. On a public TI-RADS dataset, our clinically guided multitask model with RLAR consistently improves risk stratification while maintaining segmentation quality compared to single-task training and conventional multitask baselines. Code and pretrained models will be released.

165. 【2603.21086】DGRNet: Disagreement-Guided Refinement for Uncertainty-Aware Brain Tumor Segmentation

链接https://arxiv.org/abs/2603.21086

作者:Bahram Mohammadi,Yanqiu Wu,Vu Minh Hieu Phan,Sam White,Minh-Son To,Jian Yang,Michael Sheng,Yang Song,Yuankai Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate brain tumor, Accurate brain, MRI scans, brain tumor segmentation, brain tumor

备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Accurate brain tumor segmentation from MRI scans is critical for diagnosis and treatment planning. Despite the strong performance of recent deep learning approaches, two fundamental limitations remain: (1) the lack of reliable uncertainty quantification in single-model predictions, which is essential for clinical deployment because the level of uncertainty may impact treatment decision-making, and (2) the under-utilization of rich information in radiology reports that can guide segmentation in ambiguous regions. In this paper, we propose the Disagreement-Guided Refinement Network (DGRNet), a novel framework that addresses both limitations through multi-view disagreement-based uncertainty estimation and text-conditioned refinement. DGRNet generates diverse predictions via four lightweight view-specific adapters attached to a shared encoder-decoder, enabling efficient uncertainty quantification within a single forward pass. Afterward, we build disagreement maps to identify regions of high segmentation uncertainty, which are then selectively refined according to clinical reports. Moreover, we introduce a diversity-preserving training strategy that combines pairwise similarity penalties and gradient isolation to prevent view collapse. The experimental results on the TextBraTS dataset show that DGRNet favorably improves state-of-the-art segmentation accuracy by 2.4% and 11% in main metrics Dice and HD95, respectively, while providing meaningful uncertainty estimates.

166. 【2603.21085】aming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models

链接https://arxiv.org/abs/2603.21085

作者:Qifan Li,Xingyu Zhou,Jinhua Zhang,Weiyi You,Shuhang Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:efficient image generation, learn diffusion processes, Latent diffusion models, latent space, Latent

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.

167. 【2603.21083】Hierarchical Text-Guided Brain Tumor Segmentation via Sub-Region-Aware Prompts

链接https://arxiv.org/abs/2603.21083

作者:Bahram Mohammadi,Ta Duc Huy,Afrouz Sheikholeslami,Qi Chen,Vu Minh Hieu Phan,Sam White,Minh-Son To,Xuyun Zhang,Amin Beheshti,Luping Zhou,Yuankai Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Brain tumor segmentation, ambiguous visual boundaries, tumor segmentation remains, segmentation remains challenging, exhibit ambiguous visual

备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT-TC-ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.

168. 【2603.21077】CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

链接https://arxiv.org/abs/2603.21077

作者:Nan Zhou,Huiqun Wang,Yaoyan Zheng,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:question remains unresolved, fundamental question remains, large language models, Multimodal large language, achieve remarkable progress

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.

169. 【2603.21071】CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels

链接https://arxiv.org/abs/2603.21071

作者:Ping Guo,Chengzhou Li,Guanchen Meng,Qi Jia,Jinyuan Liu,Zhu Liu,Yu Liu,Zhongxuan Luo,Xin Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:underwater sensing technologies, important underwater sensing, exhibits unique imaging, forward-looking sonar exhibits, sensing technologies

备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:As one of the most important underwater sensing technologies, forward-looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher-student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward-looking sonar images. This framework introduces a multi-teacher collaborative mechanism composed of one general teacher and multiple sonar-specific teachers. By adopting a multi-teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo-labels, we further design a cross-teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo-labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo-labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state-of-the-art approaches.

170. 【2603.21069】NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection

链接https://arxiv.org/abs/2603.21069

作者:Yupeng Zhang,Ruize Han,Zhiwei Chen,Wei Feng,Liang Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant gap remains, open-vocabulary object detection, remarkable progress, progress in open-vocabulary, gap remains

备注: CVPR 2026 Accept

点击查看摘要

Abstract:Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection this http URL address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with this http URL, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.

171. 【2603.21064】wo Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting

链接https://arxiv.org/abs/2603.21064

作者:Hwasik Jeong,Seungryong Lee,Gyeongjin Kang,Seungkwon Yang,Xiangyu Sun,Seungtae Nam,Eunbyung Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:uncalibrated multi-view images, single forward pass, enabling high-quality Gaussian, Gaussian Splatting, high-quality Gaussian representations

备注: Project page: $\href{ [this https URL](https://hwasikjeong.github.io/2Xplat) }{URL}$

点击查看摘要

Abstract:Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.

172. 【2603.21061】Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving

链接https://arxiv.org/abs/2603.21061

作者:Haixi Zhang,Aiyinsi Zuo,Zirui Li,Chunshu Wu,Tong Geng,Zhiyao Duan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving technology, camera-based autonomous driving, Amidst the rapid, rapid advancement, advancement of camera-based

备注: 9 pages, 5 figures

点击查看摘要

Abstract:Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.

173. 【2603.21055】SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM

链接https://arxiv.org/abs/2603.21055

作者:Pengchong Hu,Zhizhong Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RGBD SLAM, made remarkable progress, progress in RGBD, Gaussian Splatting, made remarkable

备注: CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at this https URL .

174. 【2603.21048】A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors

链接https://arxiv.org/abs/2603.21048

作者:Gia-Bao Doan,Nam-Khoa Huynh,Minh-Nhat-Huy Ho,Khanh-Thanh-Khoa Nguyen,Thanh-Hai Le

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:hazardous driving behaviors, in-cabin video streams, enhancing road safety, unsafe driver actions, temporal action localization

备注: 25 pages, 14 figures

点击查看摘要

Abstract:The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.

175. 【2603.21047】When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound

链接https://arxiv.org/abs/2603.21047

作者:Yasamin Medghalchi,Milad Yazdani,Amirhossein Dabiriaghdam,Moein Heidari,Mojan Izadkhah,Zahra Kavian,Giuseppe Carenini,Lele Wang,Dena Shahriari,Ilker Hacihaliloglu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-time imaging capabilities, clinical practice due, practice due, real-time imaging, imaging capabilities

备注

点击查看摘要

Abstract:Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.

176. 【2603.21046】SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments

链接https://arxiv.org/abs/2603.21046

作者:Wen Jiang,Kangyao Huang,Li Wang,Wang Xu,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hanfang Liang,Hongwei Duan,Bin Xu,Xiangyang Ji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:UAV VLN, disaster response, autonomous exploration, infrastructure inspection, play an important

备注

点击查看摘要

Abstract:UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.

177. 【2603.21045】LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction

链接https://arxiv.org/abs/2603.21045

作者:Shuwei Huang,Shizhuo Liu,Zijun Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion-based image super-resolution, Diffusion-based image, image super-resolution, reconstruct high-resolution, faces a fundamental

备注

点击查看摘要

Abstract:Diffusion-based image super-resolution (SR), which aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) observations, faces a fundamental trade-off between inference efficiency and reconstruction quality. The state-of-the-art residual-shifting diffusion framework achieves efficient 4-step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior-enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed-form analytical solution of the optimal intermediate noise for the residual-shifting diffusion paradigm, and accordingly design an LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework's core efficient residual-shifting mechanism. We further mitigate initial bias with a high-quality pre-upsampling network to optimize the diffusion starting point. With a compact 4-step trajectory, LPNSR can be optimized in an end-to-end manner. Extensive experiments demonstrate that LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets, without relying on any large-scale text-to-image priors. The source code of our method can be found at this https URL.

178. 【2603.21010】SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

链接https://arxiv.org/abs/2603.21010

作者:Zhixiang Lu,Shijie Xu,Kaicheng Yan,Xuyue Cai,Chong Zhang,Yulong Li,Angelos Stefanidis,Anh Nguyen,Jionglong Su

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extreme data scarcity, high computational costs, computational costs, extreme data, data scarcity

备注: Accepted by 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)

点击查看摘要

Abstract:The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.

179. 【2603.20999】OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

链接https://arxiv.org/abs/2603.20999

作者:Aizierjiang Aiersilan,Zhangfei Yang

类目:Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:faces dual challenges, volatile wireless channels, Deep Reinforcement Learning, uncertain gaze patterns, teleoperation faces dual

备注

点击查看摘要

Abstract:Adaptive 360° video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360° streaming in teleoperation.

180. 【2603.20985】Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

链接https://arxiv.org/abs/2603.20985

作者:Binesh Sadanandan,Vahid Behzadan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:yield identical predictions, equivalent prompts yield, prompts yield identical, semantically equivalent prompts, deploying medical vision-language

备注: CVPR 2026 Workshop on Medical Reasoning with Vision Language Foundation Models

点击查看摘要

Abstract:Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.

181. 【2603.20970】GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies

链接https://arxiv.org/abs/2603.20970

作者:Uzair Shah,Marco Agus,Mahmoud Gamal,Mahmood Alzubaidi,Corrado Cali,Pierre J. Magistretti,Abdesselam Bouzerdoum,Mowafa Househ

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neuronal morphology encodes, morphology encodes critical, encodes critical information, current methods analyze, methods analyze topology

备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: this https URL

Comments:
Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.20970 [cs.CV]

(or
arXiv:2603.20970v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.20970

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Mahmood Saleh Alzubaidi [view email] [v1]
Sat, 21 Mar 2026 22:47:24 UTC (22,953 KB)

182. 【2603.20898】Natural Gradient Descent for Online Continual Learning

链接https://arxiv.org/abs/2603.20898

作者:Joe Khawand,David Colliaux

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Online Continual Learning, Continual Learning, image classification represents, assuming data independence, Online Continual

备注: 13 pages, 2 figures

点击查看摘要

Abstract:Online Continual Learning (OCL) for image classification represents a challenging subset of Continual Learning, focusing on classifying images from a stream without assuming data independence and identical distribution (i.i.d). The primary challenge in this context is to prevent catastrophic forgetting, where the model's performance on previous tasks deteriorates as it learns new ones. Although various strategies have been proposed to address this issue, achieving rapid convergence remains a significant challenge in the online setting. In this work, we introduce a novel approach to training OCL models that utilizes the Natural Gradient Descent optimizer, incorporating an approximation of the Fisher Information Matrix (FIM) through Kronecker Factored Approximate Curvature (KFAC). This method demonstrates substantial improvements in performance across all OCL methods, particularly when combined with existing OCL tricks, on datasets such as Split CIFAR-100, CORE50, and Split miniImageNet.

183. 【2603.20887】Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

链接https://arxiv.org/abs/2603.20887

作者:Xu Zhang,Jin Yuan,BinHong Yang,Xuan Liu,Qianjun Zhang,Yuyi Wang,Zhiyong Li,Hanwang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, video multimodal interpretation, enhances users' understanding, diverse modalities, Controllable Video Segmentation

备注: 12 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users' understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user's requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users' comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at this https URL.

184. 【2603.20868】AFG-MAN: Timestep-Adaptive Frequency-Gated Latent Diffusion for Efficient and High-Quality Low-Dose CT Image Denoising

链接https://arxiv.org/abs/2603.20868

作者:Tangtangfang Fang,Yang Jiao,Xiangjian He,Jingxi Hu,Jiaqi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-dose computed tomography, reduces radiation exposure, erasing subtle anatomical, Low-dose computed, subtle anatomical details

备注

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but also introduces substantial noise and structural degradation, making it difficult to suppress noise without erasing subtle anatomical details. In this paper, we present TAFG-MAN, a latent diffusion framework for efficient and high-quality LDCT image denoising. The framework combines a perceptually optimized autoencoder, conditional latent diffusion restoration in a compact latent space, and a lightweight Timestep-Adaptive Frequency-Gated (TAFG) conditioning design. TAFG decomposes condition features into low- and high-frequency components, predicts timestep-adaptive gates from the current denoising feature and timestep embedding, and progressively releases high-frequency guidance in later denoising stages before cross-attention. In this way, the model relies more on stable structural guidance at early reverse steps and introduces fine details more cautiously as denoising proceeds, improving the balance between noise suppression and detail preservation. Experiments show that TAFG-MAN achieves a favorable quality-efficiency trade-off against representative baselines. Compared with its base variant without TAFG, it further improves detail preservation and perceptual quality while maintaining essentially the same inference cost, and ablation results confirm the effectiveness of the proposed conditioning mechanism.

185. 【2603.20860】Restoring Neural Network Plasticity for Faster Transfer Learning

链接https://arxiv.org/abs/2603.20860

作者:Xander Coetzer,Arné Schreuder,Anna Sergeevna Bosman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Transfer learning, standard practice, practice in computer, downstream task, pretrained weights

备注: 11 pages, 1 figure, 6 tables and 2 formulas

点击查看摘要

Abstract:Transfer learning with models pretrained on ImageNet has become a standard practice in computer vision. Transfer learning refers to fine-tuning pretrained weights of a neural network on a downstream task, typically unrelated to ImageNet. However, pretrained weights can become saturated and may yield insignificant gradients, failing to adapt to the downstream task. This hinders the ability of the model to train effectively, and is commonly referred to as loss of neural plasticity. Loss of plasticity may prevent the model from fully adapting to the target domain, especially when the downstream dataset is atypical in nature. While this issue has been widely explored in continual learning, it remains relatively understudied in the context of transfer learning. In this work, we propose the use of a targeted weight re-initialization strategy to restore neural plasticity prior to fine-tuning. Our experiments show that both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from this approach, yielding higher test accuracy with faster convergence on several image classification benchmarks. Our method introduces negligible computational overhead and is compatible with common transfer learning pipelines.

186. 【2603.20857】Fast and Robust Deformable 3D Gaussian Splatting

链接https://arxiv.org/abs/2603.20857

作者:Han Jiao,Jiakai Sun,Lei Zhao,Zhanjie Zhang,Wei Xing,Huaizhong Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:demonstrated remarkable real-time, Gaussian Splatting, Splatting has demonstrated, remarkable real-time rendering, real-time rendering capabilities

备注

点击查看摘要

Abstract:3D Gaussian Splatting has demonstrated remarkable real-time rendering capabilities and superior visual quality in novel view synthesis for static scenes. Building upon these advantages, researchers have progressively extended 3D Gaussians to dynamic scene reconstruction. Deformation field-based methods have emerged as a promising approach among various techniques. These methods maintain 3D Gaussian attributes in a canonical field and employ the deformation field to transform this field across temporal sequences. Nevertheless, these approaches frequently encounter challenges such as suboptimal rendering speeds, significant dependence on initial point clouds, and vulnerability to local optima in dim scenes. To overcome these limitations, we present FRoG, an efficient and robust framework for high-quality dynamic scene reconstruction. FRoG integrates per-Gaussian embedding with a coarse-to-fine temporal embedding strategy, accelerating rendering through the early fusion of temporal embeddings. Moreover, to enhance robustness against sparse initializations, we introduce a novel depth- and error-guided sampling strategy. This strategy populates the canonical field with new 3D Gaussians at low-deviation initial positions, significantly reducing the optimization burden on the deformation field and improving detail reconstruction in both static and dynamic regions. Furthermore, by modulating opacity variations, we mitigate the local optima problem in dim scenes, improving color fidelity. Comprehensive experimental results validate that our method achieves accelerated rendering speeds while maintaining state-of-the-art visual quality.

187. 【2603.20856】Ensemble of Small Classifiers For Imbalanced White Blood Cell Classification

链接https://arxiv.org/abs/2603.20856

作者:Siddharth Srivastava,Adam Smith,Scott Brooks,Jack Bacon,Till Bretschneider

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Automating white blood, white blood cell, Automating white, blood cell classification, expert pathologists

备注: Accepted at ISBI 2026 WBCBench Challenge

点击查看摘要

Abstract:Automating white blood cell classification for diagnosis of leukaemia is a promising alternative to time-consuming and resource-intensive examination of cells by expert pathologists. However, designing robust algorithms for classification of rare cell types remains challenging due to variations in staining, scanning and inter-patient heterogeneity. We propose a lightweight ensemble approach for classification of cells during Haematopoiesis, with a focus on the biology of Granulopoiesis, Monocytopoiesis and Lymphopoiesis. Through dataset expansion to alleviate some class imbalance, we demonstrate that a simple ensemble of lightweight pretrained SwinV2-Tiny, DinoBloom-Small and ConvNeXT-V2-Tiny models achieves excellent performance on this challenging dataset. We train 3 instantiations of each architecture in a stratified 3-fold cross-validation framework; for an input image, we forward-pass through all 9 models and aggregate through logit averaging. We further reason on the weaknesses of our model in confusing similar-looking myelocytes in granulopoiesis and lymphocytes in lymphopoiesis. Code: this https URL.

188. 【2603.20850】Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

链接https://arxiv.org/abs/2603.20850

作者:Xinyu Zhang,Ziyi Kou,Chuan Qin,Mia Huang,Ergys Ristani,Ankit Kumar,Lele Chen,Kun He,Abdeslam Boularias,Li Guan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Understanding hand-object interaction, Understanding hand-object, computer vision, fundamental to computer, Understanding

备注: CVPR 2026

点击查看摘要

Abstract:Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

189. 【2603.20848】GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit

链接https://arxiv.org/abs/2603.20848

作者:Chad Vanderbilt,Gabriele Campanella,Siddharth Singi,Swaraj Nanda,Jie-Fu Chen,Ali Kamali,Amir Momeni Boroujeni,David Kim,Mohamed Yakoub,Jamal Benhamida,Meera Hameed,Neeraj Kumar,Gregory Goldgof

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Tissues and Organs (q-bio.TO)

关键词:predict therapeutic response, histopathology-derived patterns extracted, whole-slide images, extracted from hematoxylin-eosin, artificial intelligence

备注

点击查看摘要

Abstract:Computational biomarkers (CBs) are histopathology-derived patterns extracted from hematoxylin-eosin (HE) whole-slide images (WSIs) using artificial intelligence (AI) to predict therapeutic response or prognosis. Recently, slide-level multiple-instance learning (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. We introduce GOLDMARK (this https URL), a standardized benchmarking framework built on a curated TCGA cohort with clinically actionable OncoKB level 1-3 biomarker labels. GOLDMARK releases structured intermediate representations, including tile coordinate maps, per-slide feature embeddings from canonical PFMs, quality-control metadata, predefined patient-level splits, trained slide-level models, and evaluation outputs. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). Restricting to the eight highest-performing tasks yielded mean AUROCs of 0.831 and 0.801, respectively. These tasks correspond to established morphologic-genomic associations (e.g., LGG IDH1, COAD MSI/BRAF, THCA BRAF/NRAS, BLCA FGFR3, UCEC PTEN) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Tissues and Organs (q-bio.TO)

Cite as:
arXiv:2603.20848 [cs.CV]

(or
arXiv:2603.20848v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.20848

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Chad Vanderbilt [view email] [v1]
Sat, 21 Mar 2026 15:09:06 UTC (7,210 KB)

190. 【2603.20839】Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking

链接https://arxiv.org/abs/2603.20839

作者:Yujin Park,Haejun Chung,Ikbeom Jang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:require quadratic cost, Pairwise comparison labeling, conventional classification labeling, yields higher inter-rater, quadratic cost

备注: 12 pages, 2 figures, Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD2026)

点击查看摘要

Abstract:Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic--aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11--16\% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5--20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy--efficiency trade-offs.

191. 【2603.20836】MERIT: Multi-domain Efficient RAW Image Translation

链接https://arxiv.org/abs/2603.20836

作者:Wenjun Huang,Shenghao Fu,Yian Jin,Yang Ni,Ziteng Cui,Hanning Chen,Yirui He,Yezi Liu,Sanggeon Yun,SungHeon Jeong,Ryozo Masukawa,William Youngwoo Chung,Mohsen Imani

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:varying spectral responses, computer vision tasks, downstream computer vision, RAW images captured, sensors exhibit substantial

备注

点击查看摘要

Abstract:RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).

192. 【2603.20828】EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis

链接https://arxiv.org/abs/2603.20828

作者:Xiefan Guo,Xinzhu Ma,Haoxiang Ma,Zihao Zhou,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable fidelity, require deep-level world, processing implicit prompts, deep-level world knowledge, explicit text prompts

备注

点击查看摘要

Abstract:Text-to-image diffusion models have achieved remarkable fidelity in synthesizing images from explicit text prompts, yet exhibit a critical deficiency in processing implicit prompts that require deep-level world knowledge, ranging from natural sciences to cultural commonsense, resulting in counter-factual synthesis. This paper traces the root of this limitation to a fundamental dislocation of the underlying knowledge structures, manifesting as a chaotic organization of implicit prompts compared to their explicit counterparts. In this paper, we propose EruDiff, which aims to refactor the knowledge within diffusion models. Specifically, we develop the Diffusion Knowledge Distribution Matching (DK-DM) to register the knowledge distribution of intractable implicit prompts with that of well-defined explicit anchors. Furthermore, to rectify the inherent biases in explicit prompt rendering, we employ the Negative-Only Reinforcement Learning (NO-RL) strategy for fine-grained correction. Rigorous empirical evaluations demonstrate that our method significantly enhances the performance of leading diffusion models, including FLUX and Qwen-Image, across both the scientific knowledge benchmark (i.e., Science-T2I) and the world knowledge benchmark (i.e., WISE), underscoring the effectiveness and generalizability. Our code is available at this https URL.

193. 【2603.20818】PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching

链接https://arxiv.org/abs/2603.20818

作者:Hanqiao Ye,Yuzhou Liu,Yangdong Liu,Shuhan Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:regressing query-map associations, planar primitives, query-map associations, structured environments, structure-based relocalizers

备注: Accepted by CVPR 2026. 20 pages, 15 figures. Code at [this https URL](https://github.com/3dv-casia/PlanaReLoc)

点击查看摘要

Abstract:While structure-based relocalizers have long strived for point correspondences when establishing or regressing query-map associations, in this paper, we pioneer the use of planar primitives and 3D planar maps for lightweight 6-DoF camera relocalization in structured environments. Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness. This motivates us to introduce PlanaReLoc, a streamlined plane-centric paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework. Through comprehensive experiments on the ScanNet and 12Scenes datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modal structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training. The code and data are available at this https URL .

194. 【2603.20811】Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation

链接https://arxiv.org/abs/2603.20811

作者:Chenxing Meng,Wuzhou Quan,Yingjie Cai,Liqun Cao,Liyan Zhang,Mingqiang Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, occlusion severely degrades, incorporating Synthetic Aperture, remote sensing imagery, severely degrades

备注: 14 page, 7 figures

点击查看摘要

Abstract:Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56\% and 0.88\% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7\% and accelerating inference by 1.98$\times$. Our implementation is available at this https URL.

195. 【2603.20808】Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

链接https://arxiv.org/abs/2603.20808

作者:Enguang Wang,Qiang Wang,Yuanchen Wu,Ke Yan,Xinbin Yuan,Shouhong Ding,Xialei Liu,Ming-Ming Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Language Models, Multimodal Large Language, Large Language, competence remains unclear, foundational competence remains

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.

196. 【2603.20806】Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification

链接https://arxiv.org/abs/2603.20806

作者:Yifeng Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-label fundus diagnosis, large-scale retinal structure, Octave Convolution increased, Multi-label fundus, fine-grained lesions

备注: 29 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.

Comments:
29 pages, 3 figures, 8 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.20806 [cs.CV]

(or
arXiv:2603.20806v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.20806

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
197. 【2603.20804】Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation

链接https://arxiv.org/abs/2603.20804

作者:Qunchao Jin,Yiliao Song,Qi Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:systems are fundamentally, partial observability, personally visited, fundamentally constrained, constrained by partial

备注

点击查看摘要

Abstract:Vision-Language Navigation (VLN) systems are fundamentally constrained by partial observability, as an agent can only accumulate knowledge from locations it has personally visited. As multiple robots increasingly coexist in shared environments, a natural question arises: can agents navigating the same space benefit from each other's observations? In this work, we introduce Co-VLN, a minimalist, model-agnostic framework for systematically investigating whether and how peer observations from concurrently navigating agents can benefit VLN. When independently navigating agents identify common traversed locations, they exchange structured perceptual memory, effectively expanding each agent's receptive field at no additional exploration cost. We validate our framework on the R2R benchmark under two representative paradigms (the learning-based DUET and the zero-shot MapGPT), and conduct extensive analytical experiments to systematically reveal the underlying dynamics of peer observation sharing in VLN. Results demonstrate that vision-sharing enabled model yields substantial performance improvements across both paradigms, establishing a strong foundation for future research in collaborative embodied navigation.

198. 【2603.20785】ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking

链接https://arxiv.org/abs/2603.20785

作者:Kanglong Fan,Tianhe Wu,Wen Wen,Jianzhao Liu,Le Yang,Yabin Zhang,Yiting Liao,Junlin Li,Li Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advance image quality, image quality assessment, Reasoning-induced vision-language models, so-called discrete collapse, advance image

备注

点击查看摘要

Abstract:Reasoning-induced vision-language models (VLMs) advance image quality assessment (IQA) with textual reasoning, yet their scalar scores often lack sensitivity and collapse to a few values, so-called discrete collapse. We introduce ME-IQA, a plug-and-play, test-time memory-enhanced re-ranking framework. It (i) builds a memory bank and retrieves semantically and perceptually aligned neighbors using reasoning summaries, (ii) reframes the VLM as a probabilistic comparator to obtain pairwise preference probabilities and fuse this ordinal evidence with the initial score under Thurstone's Case V model, and (iii) performs gated reflection and consolidates memory to improve future decisions. This yields denser, distortion-sensitive predictions and mitigates discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over strong reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.

199. 【2603.20782】MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction

链接https://arxiv.org/abs/2603.20782

作者:Jiaxin Cheng,Yue Wu,Yicong Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Learning-based edge detection, detection models trained, single-pixel annotations typically, Learning-based edge, annotations typically provided

备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2\% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.

200. 【2603.20778】PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

链接https://arxiv.org/abs/2603.20778

作者:Xiaoya Cheng,Long Wang,Yan Liu,Xinyi Liu,Hanlin Tan,Yu Liu,Maojun Zhang,Shen Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tackles UAV-based ego, unified framework, framework that tackles, tackles UAV-based, UAV-based ego

备注

点击查看摘要

Abstract:We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: this https URL.

201. 【2603.20777】OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation

链接https://arxiv.org/abs/2603.20777

作者:Aarush Aggarwal,Akshat Tomar,Amritanshu Tiwari,Sargam Goyal

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Robust semantic segmentation, safe autonomous driving, Robust semantic, deployed models remain, models remain vulnerable

备注: 10 pages, 4 figures, ICLR 2026: Principled Design for Trustworthy AI

点击查看摘要

Abstract:Robust semantic segmentation is crucial for safe autonomous driving, yet deployed models remain vulnerable to black-box adversarial attacks when target weights are unknown. Most existing approaches either craft image-wide perturbations or optimize patches for a single architecture, which limits their practicality and transferability. We introduce OmniPatch, a training framework for learning a universal adversarial patch that generalizes across images and both ViT and CNN architectures without requiring access to target model parameters.

202. 【2603.20755】Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping

链接https://arxiv.org/abs/2603.20755

作者:Sunghyun Park,Jeongho Kim,Hyoungwoo Park,Debasmit Das,Sungrack Yun,Munawar Hayat,Jaegul Choo,Fatih Porikli,Seokeon Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:personalized content creation, enabling high-quality personalized, high-quality personalized content, generation quality, enabling high-quality

备注: Accepted to CVPR 2026; 20 pages

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.

203. 【2603.20752】Smart Operation Theatre: An AI-based System for Surgical Gauze Counting

链接https://arxiv.org/abs/2603.20752

作者:Saraf Krish,Cai Yiyu,Huang Li Hui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inside patients' bodies, left inside patients', patients' bodies, regulatory penalties, lead to legal

备注

点击查看摘要

Abstract:During surgeries, there is a risk of medical gauzes being left inside patients' bodies, leading to "Gossypiboma" in patients and can cause serious complications in patients and also lead to legal problems for hospitals from malpractice lawsuits and regulatory penalties. Diagnosis depends on imaging methods such as X-rays or CT scans, and the usual treatment involves surgical excision. Prevention methods, such as manual counts and RFID-integrated gauzes, aim to minimize gossypiboma risks. However, manual tallying of 100s of gauzes by nurses is time-consuming and diverts resources from patient care. In partnership with Singapore General Hospital (SGH) we have developed a new prevention method, an AI-based system for gauze counting in surgical settings. Utilizing real-time video surveillance and object recognition technology powered by YOLOv5, a Deep Learning model was designed to monitor gauzes on two designated trays labelled "In" and "Out". Gauzes are tracked from the "In" tray, prior to their use in the patient's body in the "Out" tray post-use, ensuring accurate counting and verifying that no gauze remains inside the patient at the end of the surgery. We have trained it using numerous images from Operation Theatres augmented it to satisfy all possible scenarios. This study has also addressed the shortcomings of previous project iterations. Previously, the project employed two models: one for human detection and another for gauze detection, trained on a total of 2800 images. Now we have an integrated model capable of identifying both humans and gauzes, using a training set of 11,000 images. This has led to improvements in accuracy and increased the frame rate from 8 FPS to 15 FPS now. Incorporating doctor's feedback, the system now also supports manual count adjustments, enhancing its reliability in actual surgeries.

204. 【2603.20741】CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

链接https://arxiv.org/abs/2603.20741

作者:Xiefan Guo,Xinzhu Ma,Haiyu Zhang,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generated images remains, Recent advancements, achieving precise alignment, persistent challenge, largely propelled

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at this https URL.

205. 【2603.20739】Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding

链接https://arxiv.org/abs/2603.20739

作者:Jincen Jiang,Qianyu Zhou,Yuhang Li,Kui Su,Meili Wang,Jian Chang,Jian Jun Zhang,Xuequan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced point cloud, point cloud representation, cloud representation learning, single-domain settings, architectures have advanced

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.

206. 【2603.20738】SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval

链接https://arxiv.org/abs/2603.20738

作者:Qunjie Huang,Weina Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:distort similarity geometry, destabilize top-k rankings, Similarity Local Scaling, Cross-domain Similarity Local, making small-k shortlists

备注: Accepted to CVPR 2026. Official code: [this https URL](https://github.com/QunjieHuang/SATTC-CVPR2026)

点击查看摘要

Abstract:Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert, subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS), and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, L2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding.

207. 【2603.20731】VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

链接https://arxiv.org/abs/2603.20731

作者:Jun Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing multi-object tracking, CLIP Image Encoder, visual semantic information, algorithms typically fail, Existing multi-object

备注

点击查看摘要

Abstract:Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.

208. 【2603.20729】Weakly supervised multimodal segmentation of acoustic borehole images with depth-aware cross-attention

链接https://arxiv.org/abs/2603.20729

作者:Jose Luis Lima de Jesus Silva

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)

关键词:Acoustic borehole images, high-resolution borehole-wall structure, large-scale interpretation remains, interpretation remains difficult, dense expert annotations

备注

点击查看摘要

Abstract:Acoustic borehole images provide high-resolution borehole-wall structure, but large-scale interpretation remains difficult because dense expert annotations are rarely available and subsurface information is intrinsically multimodal. The challenge is developing weakly supervised methods combining two-dimensional image texture with depth-aligned one-dimensional well-logs. Here, we introduce a weakly supervised multimodal segmentation framework that refines threshold-guided pseudo-labels through learned models. This preserves the annotation-free character of classical thresholding and clustering workflows while extending them with denoising, confidence-aware pseudo-supervision, and physically structured fusion. We establish that threshold-guided learned refinement provides the most robust improvement over raw thresholding, denoised thresholding, and latent clustering baselines. Multimodal performance depends strongly on fusion strategy: direct concatenation provides limited gains, whereas depth-aware cross-attention, gated fusion, and confidence-aware modulation substantially improve agreement with the weak supervisory reference. The strongest model, confidence-gated depth-aware cross-attention (CG-DCA), consistently outperforms threshold-based, image-only, and earlier multimodal baselines. Targeted ablations show its advantage depends specifically on confidence-aware fusion and structured local depth interaction rather than model complexity alone. Cross-well analyses confirm this performance is broadly stable. These results establish a practical, scalable framework for annotation-free segmentation, showing multimodal improvement is maximized when auxiliary logs are incorporated selectively and depth-aware.

209. 【2603.20725】Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

链接https://arxiv.org/abs/2603.20725

作者:Zihao Wang,Yuxiang Wei,Xinpeng Zhou,Tianyu Zhang,Tao Liang,Yalong Bai,Hongzhi Zhang,Wangmeng Zuo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced rapidly, struggles to capture, capture the nuanced, preference, nuanced user preferences

备注

点击查看摘要

Abstract:Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.

210. 【2603.20721】Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

链接https://arxiv.org/abs/2603.20721

作者:Yifei Deng,Chenglong Li,Yuyang Zhang,Guyue Hu,Jin Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Text-aerial person retrieval, supporting intelligent transportation, public security applications, Text-aerial person, person retrieval aims

备注: Accepted by CVPR 2026 main track

点击查看摘要

Abstract:Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.

211. 【2603.20714】he Role and Relationship of Initialization and Densification in 3D Gaussian Splatting

链接https://arxiv.org/abs/2603.20714

作者:Ivan Desiatov,Torsten Sattler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, choice for photo-realistic, appearance and geometry, Splatting, scene appearance

备注: Sources will be made publicly available

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become the method of choice for photo-realistic 3D reconstruction of scenes, due to being able to efficiently and accurately recover the scene appearance and geometry from images. 3DGS represents the scene through a set of 3D Gaussians, parameterized by their position, spatial extent, and view-dependent color. Starting from an initial point cloud, 3DGS refines the Gaussians' parameters as to reconstruct a set of training images as accurately as possible. Typically, a sparse Structure-from-Motion point cloud is used as initialization. In order to obtain dense Gaussian clouds, 3DGS methods thus rely on a densification stage. In this paper, we systematically study the relation between densification and initialization. Proposing a new benchmark, we study combinations of different types of initializations (dense laser scans, dense (multi-view) stereo point clouds, dense monocular depth estimates, sparse SfM point clouds) and different densification schemes. We show that current densification approaches are not able to take full advantage of dense initialization as they are often unable to (significantly) improve over sparse SfM-based initialization. We will make our benchmark publicly available.

212. 【2603.20708】High-Quality and Efficient Turbulence Mitigation with Events

链接https://arxiv.org/abs/2603.20708

作者:Xiaoran Zhang,Jian Ding,Yuxing Duan,Haoyue Liu,Gang Chen,Yi Chang,Luxin Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly ill-posed due, highly ill-posed, ill-posed due, stochastic nature, Turbulence mitigation

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent ``event tubes'' in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively. Our code is available at: this https URL.

213. 【2603.20698】Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

链接https://arxiv.org/abs/2603.20698

作者:Huan Zheng,Yucheng Zhou,Tianyi Yan,Dubing Chen,Hongbo Lu,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated remarkable potential

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

214. 【2603.20697】Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

链接https://arxiv.org/abs/2603.20697

作者:Yifan Yang,Lei Zou,Wendy Jepson

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rapid situational awareness, rapid situational, aftermath of natural, situational awareness, Abstract

备注: Accepted for presentation at IGARSS 2026 (IEEE International Geoscience and Remote Sensing Symposium)

点击查看摘要

Abstract:In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.

215. 【2603.20690】MFSR: MeanFlow Distillation for One Step Real-World Image Super Resolution

链接https://arxiv.org/abs/2603.20690

作者:Ruiqing Wang,Kai Zhang,Yuanzhi Zhu,Hanshu Yan,Shilin Lu,Jian Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced Real-world Image, sampling makes inference, makes inference slow, Real-world Image Super-Resolution, multi-step sampling makes

备注

点击查看摘要

Abstract:Diffusion- and flow-based models have advanced Real-world Image Super-Resolution (Real-ISR), but their multi-step sampling makes inference slow and hard to deploy. One-step distillation alleviates the cost, yet often degrades restoration quality and removes the option to refine with more steps. We present Mean Flows for Super-Resolution (MFSR), a new distillation framework that produces photorealistic results in a single step while still allowing an optional few-step path for further improvement. Our approach uses MeanFlow as the learning target, enabling the student to approximate the average velocity between arbitrary states of the Probability Flow ODE (PF-ODE) and effectively capture the teacher's dynamics without explicit rollouts. To better leverage pretrained generative priors, we additionally improve original MeanFlow's Classifier-Free Guidance (CFG) formulation with teacher CFG distillation strategy, which enhances restoration capability and preserves fine details. Experiments on both synthetic and real-world benchmarks demonstrate that MFSR achieves efficient, flexible, and high-quality super-resolution, delivering results on par with or even better than multi-step teachers while requiring much lower computational cost.

216. 【2603.20682】IBCapsNet: Information Bottleneck Capsule Network for Noise-Robust Representation Learning

链接https://arxiv.org/abs/2603.20682

作者:Canqun Xiang,Chen Yang,Jiaoyan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high computational cost, modeling hierarchical spatial, hierarchical spatial relationships, computational cost due, critical limitations

备注

点击查看摘要

Abstract:Capsule networks (CapsNets) are superior at modeling hierarchical spatial relationships but suffer from two critical limitations: high computational cost due to iterative dynamic routing and poor robustness under input corruptions. To address these issues, we propose IBCapsNet, a novel capsule architecture grounded in the Information Bottleneck (IB) principle. Instead of iterative routing, IBCapsNet employs a one-pass variational aggregation mechanism, where primary capsules are first compressed into a global context representation and then processed by class-specific variational autoencoders (VAEs) to infer latent capsules regularized by the KL divergence. This design enables efficient inference while inherently filtering out noise. Experiments on MNIST, Fashion-MNIST, SVHN and CIFAR-10 show that IBCapsNet matches CapsNet in clean-data accuracy (achieving 99.41% on MNIST and 92.01% on SVHN), yet significantly outperforms it under four types of synthetic noise - demonstrating average improvements of +17.10% and +14.54% for clamped additive and multiplicative noise, respectively. Moreover, IBCapsNet achieves 2.54x faster training and 3.64x higher inference throughput compared to CapsNet, while reducing model parameters by 4.66%. Our work bridges information-theoretic representation learning with capsule networks, offering a principled path toward robust, efficient, and interpretable deep models. Code is available at this https URL

217. 【2603.20669】oFormer: Towards Large-scale Scenario Depth Completion for Lightweight ToF Camera

链接https://arxiv.org/abs/2603.20669

作者:Juncheng Chen,Tiancheng Lai,Xingpeng Wang,Bingxin Liao,Baozhe Zhang,Chao Xu,Yanjun Cao

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:cameras possess compact, high measurement precision, possess compact design, possess compact, ToF

备注: 17 pages, 15 figures

点击查看摘要

Abstract:Time-of-Flight (ToF) cameras possess compact design and high measurement precision to be applied to various robot tasks. However, their limited sensing range restricts deployment in large-scale scenarios. Depth completion has emerged as a potential solution to expand the sensing range of ToF cameras, but existing research lacks dedicated datasets and struggles to generalize to ToF measurements. In this paper, we propose a full-stack framework that enables depth completion in large-scale scenarios for short-range ToF cameras. First, we construct a multi-sensor platform with a reconstruction-based pipeline to collect real-world ToF samples with dense large-scale ground truth, yielding the first LArge-ScalE scenaRio ToF depth completion dataset (LASER-ToF). Second, we propose a sensor-aware depth completion network that incorporates a novel 3D branch with a 3D-2D Joint Propagation Pooling (JPP) module and Multimodal Cross-Covariance Attention (MXCA), enabling effective modeling of long-range relationships and efficient 3D-2D fusion under non-uniform ToF depth sparsity. Moreover, our network can utilize the sparse point cloud from visual SLAM as a supplement to ToF depth to further improve prediction accuracy. Experiments show that our method achieves an 8.6% lower mean absolute error than the second-best method, while maintaining lightweight design to support onboard deployment. Finally, to verify the system's applicability on real robots, we deploy proposed method on a quadrotor at a 10Hz runtime, enabling reliable large-scale mapping and long-range planning in challenging environments for short-range ToF cameras.

218. 【2603.20662】Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

链接https://arxiv.org/abs/2603.20662

作者:Xueqi Ma,Shuo Yang,Yanbei Jiang,Shu Liu,Zhenzhen Liu,Jiayang Ao,Xingjun Ma,Sarah Monazam Erfani,James Bailey

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:large Vision-Language Models, persistent challenge, spatial reasoning, remarkable advances, advances in large

备注

点击查看摘要

Abstract:Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.

219. 【2603.20648】A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation

链接https://arxiv.org/abs/2603.20648

作者:Ling Xiao,Toshihiko Yamasaki

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:requiring full retraining, fine-grained fashion image, fashion image retrieval, requiring full, dynamic scenarios

备注: Accepted by IEEE Transactions on Multimedia (TMM), to appear. Preprint version

点击查看摘要

Abstract:Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in this https URL.

220. 【2603.20644】ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

链接https://arxiv.org/abs/2603.20644

作者:Guanzhou Chen,Erfei Cui,Changyao Tian,Danni Yang,Ganlin Yang,Yu Qiao,Hongsheng Li,Gen Luo,Hongjie Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:APIs remains challenging, costly proprietary APIs, proprietary APIs remains, Instruction-based image editing, unified multimodal models

备注

点击查看摘要

Abstract:Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed-source models for annotation, which prevents cost-effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open-source hierarchical multi-agent framework for end-to-end construction of large-scale, high-quality image editing datasets. Our pipeline consists of three key components: source image expansion with world-knowledge infusion, adaptive multi-agent editing instruction-image synthesis, and a task-aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit-12M, the largest open-source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine-tuning UniWorld-V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS-Bench for knowledge-infused benchmarks. These results demonstrate that open-source, agentic pipelines can approach commercial-grade data quality while retaining cost-effectiveness and scalability. Both the framework and dataset will be open-sourced.

221. 【2603.20611】GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction

链接https://arxiv.org/abs/2603.20611

作者:Di Kong,Yikai Wang,Wenjie Guo,Yifan Bu,Boya Zhang,Yuexin Duan,Xiawei Yue,Wenbiao Du,Yiman Zhong,Yuwen Chen,Cheng Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving internal structure, structure for analysis, widely applied, demands representations, representations that compress

备注: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

点击查看摘要

Abstract:Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. We introduce GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our proposed method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D Gaussians to model through-slice contributions, (ii) a differentiable projection operator that encodes the finite-thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real-time rendering efficiency of Gaussian primitives while preserving high-frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as 3 minutes, up to 11x faster than NeRF-based approaches, and achieves consistent 16x compression over voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.

222. 【2603.20588】RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction

链接https://arxiv.org/abs/2603.20588

作者:Feiran Wang,Zezhou Shang,Gaowen Liu,Yan Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables real-time joint, real-time joint estimation, poses from RGB, reconstruction enables real-time, RGB images

备注: Project page: [this https URL](https://raymap3r.github.io/)

点击查看摘要

Abstract:Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.

223. 【2603.20584】Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance

链接https://arxiv.org/abs/2603.20584

作者:Liangyu Yuan,Yufei Huang,Mingkun Lei,Tong Zhao,Ruoyu Wang,Changxi Chi,Yiwei Wang,Chi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:iterative refinement process, generate synthetic images, Classifier Free Guidance, Diffusion models generate, models generate synthetic

备注: 22 pages, 12 figures

点击查看摘要

Abstract:Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulation-free objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SGG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SGG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SGG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings. Code is available at this https URL.

224. 【2603.20583】GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories

链接https://arxiv.org/abs/2603.20583

作者:Tomasz Frelek,Rohan Patil,Akshar Tumu,Henrik I. Christensen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:complex urban environments, segmenting feasible vehicle, feasible vehicle trajectories, scalable self-supervised approach, urban environments

备注: 8 pages, 27 figures, 1 table

点击查看摘要

Abstract:We present a scalable self-supervised approach for segmenting feasible vehicle trajectories from monocular images for autonomous driving in complex urban environments. Leveraging large-scale dashcam videos, we treat recorded ego-vehicle motion as implicit supervision and recover camera trajectories via monocular structure-from-motion, projecting them onto the ground plane to generate spatial masks of traversed regions without manual annotation. These automatically generated labels are used to train a deep segmentation network that predicts motion-conditioned path proposals from a single RGB image at run time, without explicit modeling of road or lane markings. Trained on diverse, unconstrained internet data, the model implicitly captures scene layout, lane topology, and intersection structure, and generalizes across varying camera configurations. We evaluate our approach on NuScenes, demonstrating reliable trajectory prediction, and further show transfer to an electric scooter platform through light fine-tuning. Our results indicate that large-scale ego-motion distillation yields structured and generalizable path proposals beyond the demonstrated trajectory, enabling trajectory hypothesis estimation via image segmentation.

225. 【2603.20554】When Negation Is a Geometry Problem in Vision-Language Models

链接https://arxiv.org/abs/2603.20554

作者:Fawaz Sammani,Tzoulio Chamiti,Paul Gavrikov,Nikos Deligiannis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Joint Vision-Language Embedding, plain blue shirt, Joint Vision-Language, failing to distinguish, text queries

备注: Accepted to CVPR (Multimodal Algorithmic Reasoning Workshop) 2026

点击查看摘要

Abstract:Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.

226. 【2603.20530】Memory Over Maps: 3D Object Localization Without Reconstruction

链接https://arxiv.org/abs/2603.20530

作者:Rui Zhou,Xander Yap,Jianwen Cao,Allison Lau,Boyang Sun,Marc Pollefeys

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:prerequisite for embodied, scene, Target localization, localization, embodied tasks

备注: 8 pages, 6 figures

点击查看摘要

Abstract:Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory--without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: this https URL

227. 【2603.20519】End-to-End Optimization of Polarimetric Measurement and Material Classifier

链接https://arxiv.org/abs/2603.20519

作者:Ryota Maeda,Naoki Arikawa,Yutaka No,Shinsaku Hiura

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scene understanding, fundamental problem, problem in computer, computer vision, vision and plays

备注: Presented at VISAPP 2026 (21st International Conference on Computer Vision Theory and Applications)

点击查看摘要

Abstract:Material classification is a fundamental problem in computer vision and plays a crucial role in scene understanding. Previous studies have explored various material recognition methods based on reflection properties such as color, texture, specularity, and scattering. Among these cues, polarization is particularly valuable because it provides rich material information and enables recognition even at distances where capturing high-resolution texture is impractical. However, measuring polarimetric reflectance properties typically requires multiple modulations of the polarization state of the incident light, making the process time-consuming and often unnecessary for certain recognition tasks. While material classification can be achieved using only a subset of polarimetric measurements, the optimal configuration of measurement angles remains unclear. In this study, we propose an end-to-end optimization framework that jointly learns a material classifier and determines the optimal combinations of rotation angles for polarization elements that control both the incident and reflected light states. Using our Mueller-matrix material dataset, we demonstrate that our method achieves high-accuracy material classification even with a limited number of measurements.

228. 【2603.20509】Lessons and Open Questions from a Unified Study of Camera-Trap Species Recognition Over Time

链接https://arxiv.org/abs/2603.20509

作者:Sooyoung Jeon,Hongjie Tian,Lemeng Wang,Zheda Mai,Vidhi Bakshi,Jiacheng Hou,Ping Zhang,Arpita Chowdhury,Jianyang Gu,Wei-Lun Chao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale biodiversity monitoring, diverse deployment environments, accurate automated analysis, biodiversity monitoring, vital for large-scale

备注: The first three authors contribute equally

点击查看摘要

Abstract:Camera traps are vital for large-scale biodiversity monitoring, yet accurate automated analysis remains challenging due to diverse deployment environments. While the computer vision community has mostly framed this challenge as cross-domain generalization, this perspective overlooks a primary challenge faced by ecological practitioners: maintaining reliable recognition at the fixed site over time, where the dynamic nature of ecosystems introduces profound temporal shifts in both background and animal distributions. To bridge this gap, we present the first unified study of camera-trap species recognition over time. We introduce a realistic benchmark comprising 546 camera traps with a streaming protocol that evaluates models over chronologically ordered intervals. Our end-user-centric study yields four key findings. (1) Biological foundation models (e.g., BioCLIP 2) underperform at numerous sites even in initial intervals, underscoring the necessity of site-specific adaptation. (2) Adaptation is challenging under realistic evaluation: when models are updated using past data and evaluated on future intervals (mirrors real deployment lifecycles), naive adaptation can even degrade below zero-shot performance. (3) We identify two drivers of this difficulty: severe class imbalance and pronounced temporal shift in both species distribution and backgrounds between consecutive intervals. (4) We find that effective integration of model-update and post-processing techniques can largely improve accuracy, though a gap from the upper bounds remains. Finally, we highlight critical open questions, such as predicting when zero-shot models will succeed at a new site and determining whether/when model updates are necessary. Our benchmark and analysis provide actionable deployment guidelines for ecological practitioners while establishing new directions for future research in vision and machine learning.

229. 【2603.20475】CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models

链接https://arxiv.org/abs/2603.20475

作者:Kaizhen Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains poorly understood, relations remains poorly, Vision-language models, encode directional relations, directional relations remains

备注

点击查看摘要

Abstract:Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.

230. 【2603.20461】Inverting Neural Networks: New Methods to Generate Neural Network Inputs from Prescribed Outputs

链接https://arxiv.org/abs/2603.20461

作者:Rebecca Pattichis,Sebastian Janampa,Constantinos S. Pattichis,Marios S. Pattichis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systems describe complex, Neural network systems, network systems describe, describe complex mappings, difficult to understand

备注: Accepted at 2024 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI)

点击查看摘要

Abstract:Neural network systems describe complex mappings that can be very difficult to understand. In this paper, we study the inverse problem of determining the input images that get mapped to specific neural network classes. Ultimately, we expect that these images contain recognizable features that are associated with their corresponding class classifications. We introduce two general methods for solving the inverse problem. In our forward pass method, we develop an inverse method based on a root-finding algorithm and the Jacobian with respect to the input image. In our backward pass method, we iteratively invert each layer, at the top. During the inversion process, we add random vectors sampled from the null-space of each linear layer. We demonstrate our new methods on both transformer architectures and sequential networks based on linear layers. Unlike previous methods, we show that our new methods are able to produce random-like input images that yield near perfect classification scores in all cases, revealing vulnerabilities in the underlying networks. Hence, we conclude that the proposed methods provide a more comprehensive coverage of the input image spaces that solve the inverse mapping problem.

231. 【2603.20448】hermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis

链接https://arxiv.org/abs/2603.20448

作者:M. Kerem Aydin,Vishwanath Saragadam,Emma Alexander

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:provide reliable visibility, imagery remains significantly, remains significantly harder, thermal imagery remains, cameras provide reliable

备注: To be published at CVPR, 2026. 15 Pages, 29 Figures

点击查看摘要

Abstract:Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance (beyond camera pose) is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.

232. 【2603.20428】Benchmarking Efficient Effective Camera Pose Estimation Strategies for Novel View Synthesis

链接https://arxiv.org/abs/2603.20428

作者:Jhacson Meza,Martin R. Oswald,Torsten Sattler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:NVS, SfM, scene representation, produce photo-realistic, Classical SfM

备注

点击查看摘要

Abstract:Novel view synthesis (NVS) approaches such as NeRFs or 3DGS can produce photo-realistic 3D scene representation from a set of images with known extrinsic and intrinsic parameters. The necessary camera poses and calibrations are typically obtained from the images via Structure-from-Motion (SfM). Classical SfM approaches rely on local feature matches between the images to estimate both the poses and a sparse 3D model of the scene, using bundle adjustment to refine initial pose, intrinsics, and geometry estimates. In order to increase run-time efficiency, recent SfM systems forgo optimization via bundle adjustment. Instead, they train feed-forward (transformer-based) neural networks to directly regress camera parameters and the 3D structure. While orders of magnitude more efficient, such recent works produce significantly less accurate estimates. To stimulate research on developing SfM approaches that are both efficient \emph{and} effective, this paper develops a benchmark focused on SfM for novel view synthesis. Using existing datasets and two simple strategies for making the reconstruction process more efficient, we show that: (1) simply using fewer features already significantly accelerates classical SfM methods while maintaining high pose accuracy. (2) using feed-forward networks to obtain initial estimates and refining them using classical SfM techniques leads to the best efficiency-effectiveness trade-off. We will make our benchmark and code publicly available.

233. 【2603.20422】PEARL: Personalized Streaming Video Understanding Model

链接https://arxiv.org/abs/2603.20422

作者:Yuanhong Zheng,Ruichuan An,Xiaopeng Lin,Yuxing Liu,Sihan Yang,Huanyu Zhang,Haodong Li,Qintong Zhang,Renrui Zhang,Guopeng Li,Yifan Zhang,Yuheng Li,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:memories over time, continuously recognize, identities and update, update our memories, Streaming Video Understanding

备注: Arxiv Submission

点击查看摘要

Abstract:Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at this https URL.

234. 【2603.20403】FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection

链接https://arxiv.org/abs/2603.20403

作者:Maxime Fontana,Michael Spratling,Miaojing Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adapting models pre-trained, Adapting models, reach strong performance, strong performance quickly, models pre-trained

备注: CVPR 2026

点击查看摘要

Abstract:Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for down-stream tasks. However, the growth of state-of-the-art mod-els makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for differ-ent tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that cap-tures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrink-ing (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.

235. 【2603.20391】Monocular Models are Strong Learners for Multi-View Human Mesh Recovery

链接https://arxiv.org/abs/2603.20391

作者:Haoyu Xie,Shengkai Xu,Cheng Guo,Muhammad Usama Saleem,Wenhan Wu,Chen Chen,Ahmed Helmy,Pu Wang,Hongfei Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human mesh recovery, Multi-view human mesh, mesh recovery, generalization are essential, human mesh

备注

点击查看摘要

Abstract:Multi-view human mesh recovery (HMR) is broadly deployed in diverse domains where high accuracy and strong generalization are essential. Existing approaches can be broadly grouped into geometry-based and learning-based methods. However, geometry-based methods (e.g., triangulation) rely on cumbersome camera calibration, while learning-based approaches often generalize poorly to unseen camera configurations due to the lack of multi-view training data, limiting their performance in real-world scenarios. To enable calibration-free reconstruction that generalizes to arbitrary camera setups, we propose a training-free framework that leverages pretrained single-view HMR models as strong priors, eliminating the need for multi-view training data. Our method first constructs a robust and consistent multi-view initialization from single-view predictions, and then refines it via test-time optimization guided by multi-view consistency and anatomical constraints. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.

236. 【2603.20386】Jigsaw Regularization in Whole-Slide Image Classification

链接https://arxiv.org/abs/2603.20386

作者:So Won Jeong,Veronika Ročková

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Computational pathology involves, Computational pathology, pathology involves, involves the digitization, digitization of stained

备注

点击查看摘要

Abstract:Computational pathology involves the digitization of stained tissues into whole-slide images (WSIs) that contain billions of pixels arranged as contiguous patches. Statistical analysis of WSIs largely focuses on classification via multiple instance learning (MIL), in which slide-level labels are inferred from unlabeled patches. Most MIL methods treat patches as exchangeable, overlooking the rich spatial and topological structure that underlies tissue images. This work builds on recent graph-based methods that aim to incorporate spatial awareness into MIL. Our approach is new in two regards: (1) we deploy vision \emph{foundation-model embeddings} to incorporate local spatial structure within each patch, and (2) achieve across-patch spatial awareness using graph neural networks together with a novel {\em jigsaw regularization}. We find that a combination of these two features markedly improves classification over state-of-the-art attention-based MIL approaches on benchmark datasets in breast, head-and-neck, and colon cancer.

237. 【2603.20383】Multi-Stage Fine-Tuning of Pathology Foundation Models with Head-Diverse Ensembling for White Blood Cell Classification

链接https://arxiv.org/abs/2603.20383

作者:Antony Gitau,Martin Paulson,Bjørn-Jostein Singstad,Karl Thomas Hjelmervik,Ola Marius Lysaker,Veralia Gabriela Sanchez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:white blood cells, peripheral blood smears, blood cells, white blood, peripheral blood

备注: Accepted to ISBI 2026

点击查看摘要

Abstract:The classification of white blood cells (WBCs) from peripheral blood smears is critical for the diagnosis of leukemia. However, automated approaches still struggle due to challenges including class imbalance, domain shift, and morphological continuum confusion, where adjacent maturation stages exhibit subtle, overlapping features. We present a multi-stage fine-tuning methodology for 13-class WBC classification in the WBCBench 2026 Challenge (ISBI 2026). Our best-performing model is a fine-tuned DINOBloom-base, on which we train multiple classifier head families (linear, cosine, and multilayer perceptron (MLP)). The cosine head performed best on the mature granulocyte boundary (Band neutrophil (BNE) F1 = 0.470), the linear head on more immature granulocyte classes (Metamyelocyte (MMY) F1 = 0.585), and the MLP head on the most immature granulocyte (Promyelocyte (PMY) F1 = 0.733), revealing class-specific specialization. Based on this specialization, we construct a head-diverse ensemble, where the MLP head acts as the primary predictor, and its predictions within the four predefined confusion pairs are replaced only when two other head families agree. We further show that cases consistently misclassified by all models are substantially enriched for probable labeling errors or inherent morphological ambiguity.

238. 【2603.20382】Uni-Classifier: Leveraging Video Diffusion Priors for Universal Guidance Classifier

链接https://arxiv.org/abs/2603.20382

作者:Yujie Zhou,Pengyang Ling,Jiazi Bu,Bingjie Gao,Li Niu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:involve chaining multiple, chaining multiple generative, image generator, practical AI workflows, involve chaining

备注: Accepted by ICME 2026

点击查看摘要

Abstract:In practical AI workflows, complex tasks often involve chaining multiple generative models, such as using a video or 3D generation model after a 2D image generator. However, distributional mismatches between the output of upstream models and the expected input of downstream models frequently degrade overall generation quality. To address this issue, we propose Uni-Classifier (Uni-C), a simple yet effective plug-and-play module that leverages video diffusion priors to guide the denoising process of preceding models, thereby aligning their outputs with downstream requirements. Uni-C can also be applied independently to enhance the output quality of individual generative models. Extensive experiments across video and 3D generation tasks demonstrate that Uni-C consistently improves generation quality in both workflow-based and standalone settings, highlighting its versatility and strong generalization capability.

239. 【2603.20353】Scene Representation using 360° Saliency Graph and its Application in Vision-based Indoor Navigation

链接https://arxiv.org/abs/2603.20353

作者:Preeti Meena,Himanshu Kumar,Sandeep Yadav

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Signal Processing (eess.SP)

关键词:LiDAR scan, represented visually, saliency graph representation, Scene, saliency graph

备注

点击查看摘要

Abstract:A Scene, represented visually using different formats such as RGB-D, LiDAR scan, keypoints, rectangular, spherical, multi-views, etc., contains information implicitly embedded relevant to applications such as scene indexing, vision-based navigation. Thus, these representations may not be efficient for such applications. This paper proposes a novel 360° saliency graph representation of the scenes. This rich representation explicitly encodes the relevant visual, contextual, semantic, and geometric information of the scene as nodes, edges, edge weights, and angular position in the 360° graph. Also, this representation is robust against scene view change and addresses challenges of indoor environments such as varied illumination, occlusions, and shadows as in the case of existing traditional methods. We have utilized this rich and efficient representation for vision-based navigation and compared it with existing navigation methods using 360° scenes. However, these existing methods suffer from limitations of poor scene representation, lacking scene-specific information. This work utilizes the proposed representation first to localize the query scene in the given topological map, and then facilitate 2D navigation by estimating the next required movement directions towards the target destination in the topological map by using the embedded geometric information in the 360° saliency graph. Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation.

240. 【2603.20348】oward a Multi-View Brain Network Foundation Model: Cross-View Consistency Learning Across Arbitrary Atlases

链接https://arxiv.org/abs/2603.20348

作者:Jiaxing Xu,Jingying Ma,Xin Lin,Yuxiao Liu,Kai He,Qika Lin,Yiping Ke,Yang Li,Dinggang Shen,Mengling Feng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neurological disorder identification, characterizing brain organization, brain network foundation, Brain network analysis, Brain network

备注

点击查看摘要

Abstract:Brain network analysis provides an interpretable framework for characterizing brain organization and has been widely used for neurological disorder identification. Recent advances in self-supervised learning have motivated the development of brain network foundation models. However, existing approaches are often limited by atlas dependency, insufficient exploitation of multiple network views, and weak incorporation of anatomical priors. In this work, we propose MV-BrainFM, a multi-view brain network foundation model designed to learn generalizable and scalable representations from brain networks constructed with arbitrary atlases. MV-BrainFM explicitly incorporates anatomical distance information into Transformer-based modeling to guide inter-regional interactions, and introduces an unsupervised cross-view consistency learning strategy to align representations from multiple atlases of the same subject in a shared latent space. By jointly enforcing within-view robustness and cross-view alignment during pretraining, the model effectively captures complementary information across heterogeneous network views while remaining atlas-aware. In addition, MV-BrainFM adopts a unified multi-view pretraining paradigm that enables simultaneous learning from multiple datasets and atlases, significantly improving computational efficiency compared to conventional sequential training strategies. The proposed framework also demonstrates strong scalability, consistently benefiting from increasing data diversity while maintaining stable performance across unseen atlas configurations. Extensive experiments on more than 20K subjects from 17 fMRI datasets show that MV-BrainFM consistently outperforms 14 existing brain network foundation models and task-specific baselines under both single-atlas and multi-atlas settings.

241. 【2603.20337】High-fidelity Multi-view Normal Integration with Scale-encoded Neural Surface Representation

链接https://arxiv.org/abs/2603.20337

作者:Tongyu Yang,Heng Guo,Yasuyuki Matsushita,Fumio Okura,Yu Luo,Xin Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Previous multi-view normal, methods typically sample, Previous multi-view, typically sample, sample a single

备注: 12 pages, 11 figures

点击查看摘要

Abstract:Previous multi-view normal integration methods typically sample a single ray per pixel, without considering the spatial area covered by each pixel, which varies with camera intrinsics and the camera-to-object distance. Consequently, when the target object is captured at different distances, the normals at corresponding pixels may differ across views. This multi-view surface normal inconsistency results in the blurring of high-frequency details in the reconstructed surface. To address this issue, we propose a scale-encoded neural surface representation that incorporates the pixel coverage area into the neural representation. By associating each 3D point with a spatial scale and calculating its normal from a hybrid grid-based encoding, our method effectively represents multi-scale surface normals captured at varying distances. Furthermore, to enable scale-aware surface reconstruction, we introduce a mesh extraction module that assigns an optimal local scale to each vertex based on the training observations. Experimental results demonstrate that our approach consistently yields high-fidelity surface reconstruction from normals observed at varying distances, outperforming existing multi-view normal integration methods.

242. 【2603.20327】Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

链接https://arxiv.org/abs/2603.20327

作者:Liu hung ming

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Embedding Predictive Architectures, Joint Embedding Predictive, Predictive Architectures, Joint Embedding, Embedding Predictive

备注: 35 pages, 6 figures, 3 tables, 26 equations; independent research report; Stage 1 of a four-stage AIM--V-JEPA 2 integration roadmap; code available at [this https URL](https://github.com/cyrilliu1974/JEPA)

点击查看摘要

Abstract:Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations -- not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p 10^{-4}; MI 0.036--0.117 bits, NMI 1.2--3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.

Comments:
35 pages, 6 figures, 3 tables, 26 equations; independent research report; Stage 1 of a four-stage AIM–V-JEPA 2 integration roadmap; code available at this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.20327 [cs.LG]

(or
arXiv:2603.20327v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.20327

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
243. 【2603.20326】Prompt-Free Lightweight SAM Adaptation for Histopathology Nuclei Segmentation with Strong Cross-Dataset Generalization

链接https://arxiv.org/abs/2603.20326

作者:Muhammad Hassan Maqsood,Yanming Zhu,Alfred Lam,Getamesay Dagnaw,Xuefei Yin,Alan Wee-Chung Liew

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:quantitative tissue analysis, cancer diagnosis, crucial for quantitative, quantitative tissue, tissue analysis

备注

点击查看摘要

Abstract:Histopathology nuclei segmentation is crucial for quantitative tissue analysis and cancer diagnosis. Although existing segmentation methods have achieved strong performance, they are often computationally heavy and show limited generalization across datasets, which constrains their practical deployment. Recent SAM-based approaches have shown great potential in general and medical imaging, but typically rely on prompt guidance or complex decoders, making them less suitable for histopathology images with dense nuclei and heterogeneous appearances. We propose a prompt-free and lightweight SAM adaptation that leverages multi-level encoder features and residual decoding for accurate and efficient nuclei segmentation. The framework fine-tunes only LoRA modules within the frozen SAM encoder, requiring just 4.1M trainable parameters. Experiments on three benchmark datasets TNBC, MoNuSeg, and PanNuke demonstrate state-of-the-art performance and strong cross-dataset generalization, highlighting the effectiveness and practicality of the proposed framework for histopathology applications.

244. 【2603.20325】DCG-Net: Dual Cross-Attention with Concept-Value Graph Reasoning for Interpretable Medical Diagnosis

链接https://arxiv.org/abs/2603.20325

作者:Getamesay Dagnaw,Xuefei Yin,Muhammad Hassan Maqsood,Yanming Zhu,Alan Wee-Chung Liew

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning models, medical image analysis, internal decision processes, decision processes remain, processes remain difficult

备注

点击查看摘要

Abstract:Deep learning models have achieved strong performance in medical image analysis, but their internal decision processes remain difficult to interpret. Concept Bottleneck Models (CBMs) partially address this limitation by structuring predictions through human-interpretable clinical concepts. However, existing CBMs typically overlook the contextual dependencies among concepts. To address these issues, we propose an end-to-end interpretable framework \emph{DCG-Net} that integrates multimodal alignment with structured concept reasoning. DCG-Net introduces a Dual Cross-Attention module that replaces cosine similarity matching with bidirectional attention between visual tokens and canonicalized textual concept-value prototypes, enabling spatially localized evidence attribution. To capture the relational structure inherent to clinical concepts, we develop a Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing. This formulation models inter-concept dependencies in a manner consistent with clinical domain knowledge. Experiments on white blood cell morphology and skin lesion diagnosis demonstrate that DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.

245. 【2603.20323】NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation

链接https://arxiv.org/abs/2603.20323

作者:Quang Dang Huynh,Xuefei Yin,Andrew Busch,Hugo G. Espinosa,Alan Wee-Chung Liew,Matthew T.O. Worsey,Yanming Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex spatiotemporal dynamics, estimation remains challenged, spatiotemporal dynamics, pose estimation remains, remains challenged

备注

点击查看摘要

Abstract:Video-based human pose estimation remains challenged by motion blur, occlusion, and complex spatiotemporal dynamics. Existing methods often rely on heatmaps or implicit spatio-temporal feature aggregation, which limits joint topology expressiveness and weakens cross-frame consistency. To address these problems, we propose a novel node-centric framework that explicitly integrates visual, temporal, and structural reasoning for accurate pose estimation. First, we design a visuo-temporal velocity-based joint embedding that fuses sub-pixel joint cues and inter-frame motion to build appearance- and motion-aware representations. Then, we introduce an attention-driven pose-query encoder, which applies attention over joint-wise heatmaps and frame-wise features to map the joint representations into a pose-aware node space, generating image-conditioned joint-aware node embeddings. Building upon these node embeddings, we propose a dual-branch decoupled spatio-temporal attention graph that models temporal propagation and spatial constraint reasoning in specialized local and global branches. Finally, a node-space expert fusion module is proposed to adaptively fuse the complementary outputs from both branches, integrating local and global cues for final joint predictions. Extensive experiments on three widely used video pose benchmarks demonstrate that our method outperforms state-of-the-art methods. The results highlight the value of explicit node-centric reasoning, offering a new perspective for advancing video-based human pose estimation.

246. 【2603.20317】Which Workloads Belong in Orbit? A Workload-First Framework for Orbital Data Centers Using Semantic Abstraction

链接https://arxiv.org/abs/2603.20317

作者:Durgendra Narayan Singh

类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)

关键词:launch costs fall, Space-based compute, plausible as launch, launch costs, costs fall

备注

点击查看摘要

Abstract:Space-based compute is becoming plausible as launch costs fall and data-intensive AI workloads grow. This paper proposes a workload-centric framework for deciding which tasks belong in orbit versus terrestrial cloud, along with a phased adoption model tied to orbital data center maturity. We ground the framework with in-orbit semantic-reduction prototypes. An Earth-observation pipeline on Sentinel-2 imagery from Seattle and Bengaluru (formerly Bangalore) achieves 99.7-99.99% payload reduction by converting raw imagery to compact semantic artifacts. A multi-pass stereo reconstruction prototype reduces ~306 MB to ~1.57 MB of derived 3D representations (99.49% reduction). These results support a workload-first view in which semantic abstraction, not raw compute scale, drives early workload suitability.

247. 【2603.20314】VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs

链接https://arxiv.org/abs/2603.20314

作者:Govinda Kolli,Adinath Madhavrao Dukre,Behzad Bozorgtabar,Dwarikanath Mahapatra,Imran Razzak

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Medical Vision-Language Models, generating responses based, Medical Vision-Language, Vision-Language Models, Visual Grounding Score

备注

点击查看摘要

Abstract:Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token's visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98\%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.

248. 【2603.20310】GraphiContact: Pose-aware Human-Scene Robust Contact Perception for Interactive Systems

链接https://arxiv.org/abs/2603.20310

作者:Xiaojian Lin,Yaomin Shen,Junyuan Ma,Yujie Sun,Chengqing Bu,Wenxin Zhang,Zongzheng Zhang,Hao Fei,Lei Jin,Hao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Monocular vertex-level human-scene, Monocular vertex-level, assistive monitoring, fundamental capability, capability for interactive

备注: 15 pages, 9 figures, Accepted at ICME 2026

点击查看摘要

Abstract:Monocular vertex-level human-scene contact prediction is a fundamental capability for interactive systems such as assistive monitoring, embodied AI, and rehabilitation analysis. In this work, we study this task jointly with single-image 3D human mesh reconstruction, using reconstructed body geometry as a scaffold for contact reasoning. Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise. To address this gap, we propose GraphiContact, a pose-aware framework that transfers complementary human priors from two pretrained Transformer encoders and predicts per-vertex human-scene contact on the reconstructed mesh. To improve robustness in real-world scenarios, we further introduce a Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing, which simulates occlusion and noisy observations during training while preserving efficient single-branch inference at test time. Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction. Our code, based on the GraphiContact method, provides comprehensive 3D human reconstruction and interaction analysis, and will be publicly available at this https URL.

249. 【2603.20307】EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

链接https://arxiv.org/abs/2603.20307

作者:Yuzhe Weng,Haotian Wang,Yuanhong Yu,Jun Du,Shan He,Xiaoyan Wu,Haoran Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)

关键词:portrait and speech, Audio-driven talking head, aims to create, create vivid, vivid and realistic

备注

点击查看摘要

Abstract:Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.

250. 【2603.20305】he Global-Local loop: what is missing in bridging the gap between geospatial data from numerous communities?

链接https://arxiv.org/abs/2603.20305

作者:Clément Mallet,Ana-Maria Raimond

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth Surface, indirectly the Earth, describing directly, satellites to citizens, face a unprecedented

备注: Accepted at the 2026 ISPRS Congress

点击查看摘要

Abstract:We face a unprecedented amount of geospatial data, describing directly or indirectly the Earth Surface at multiple spatial, temporal, and semantic scales, and stemming from numerous contributors, from satellites to citizens. The main challenge in all the geospatial-related communities lies in suitably leveraging a combination of some of the sources for either a generic or a thematic application. Certain data fusion schemes are predominantly exploited: they correspond to popular tasks with mainstream data sources, e.g., free archives of Sentinel images coupled with OpenStreetMap data under an open and widespread deep-learning backbone for land-cover mapping purposes. Most of these approaches unfortunately operate under a "master-slave" paradigm, where one source is basically integrated to help processing the "main" source, without mutual advantages (e.g., large-scale estimation of a given biophysical variable using in-situ observations) and under a specific community bias. We argue that numerous key data fusion configurations, and in particular the effort in symmetrizing the exploitation of multiple data sources, are insufficiently addressed while being highly beneficial for generic or thematic applications. Bridges and retroactions between scales, communities and their respective sources are lacking, neglecting the utmost potential of such a "global-local loop". In this paper, we propose to establish the most relevant interaction schemes through illustrative use cases. We subsequently discuss under-explored research directions that could take advantage of leveraging available data through multiples extents and communities.

251. 【2603.20304】ransferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges

链接https://arxiv.org/abs/2603.20304

作者:Hong-Hanh Nguyen-Le,Van-Tuan Tran,Thuc D. Nguyen,Nhien-An Le-Khac

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:photorealistic image generation, Diffusion Implicit Models, Denoising Diffusion Implicit, enable photorealistic image, unprecedented scale

备注

点击查看摘要

Abstract:As diffusion models (DMs) enable photorealistic image generation at unprecedented scale, watermarking techniques have become essential for provenance establishment and accountability. Existing methods face challenges: sampling-based approaches operate on frozen models but require costly $N$-step Denoising Diffusion Implicit Models (DDIM) inversion (typically N=50) for zero-bit-only detection; fine-tuning-based methods achieve fast multi-bit extraction but couple the watermark to a specific model checkpoint, requiring retraining for each architecture. We propose DiffMark, a plug-and-play watermarking method that offers three key advantages over existing approaches: single-pass multi-bit detection, per-image key flexibility, and cross-model transferability. Rather than encoding the watermark into the initial noise vector, DiffMark injects a persistent learned perturbation $\delta$ at every denoising step of a completely frozen DM. The watermark signal accumulates in the final denoised latent $z_0$ and is recovered in a single forward pass. The central challenge of backpropagating gradients through a frozen UNet without traversing the full denoising chain is addressed by employing Latent Consistency Models (LCM) as a differentiable training bridge. This reduces the number of gradient steps from 50 DDIM to 4 LCM and enables a single-pass detection at 16.4 ms, a 45x speedup over sampling-based methods. Moreover, by this design, the encoder learns to map any runtime secret to a unique perturbation at inference time, providing genuine per-image key flexibility and transferability to unseen diffusion-based architectures without per-model fine-tuning. Although achieving these advantages, DiffMark also maintains competitive watermark robustness against distortion, regeneration, and adversarial attacks.

252. 【2603.20303】InjectFlow: Weak Guides Strong via Orthogonal Injection for Flow Matching

链接https://arxiv.org/abs/2603.20303

作者:Dayu Wang,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ordinary differential equation, high-fidelity visual generation, robust continuous-time alternative, differential equation, recently emerged

备注

点击查看摘要

Abstract:Flow Matching (FM) has recently emerged as a leading approach for high-fidelity visual generation, offering a robust continuous-time alternative to ordinary differential equation (ODE) based models. However, despite their success, FM models are highly sensitive to dataset biases, which cause severe semantic degradation when generating out-of-distribution or minority-class samples. In this paper, we provide a rigorous mathematical formalization of the ``Bias Manifold'' within the FM framework. We identify that this performance drop is driven by conditional expectation smoothing, a mechanism that inevitably leads to trajectory lock-in during inference. To resolve this, we introduce InjectFlow, a novel, training-free method by injecting orthogonal semantics during the initial velocity field computation, without requiring any changes to the random seeds. This design effectively prevents the latent drift toward majority modes while maintaining high generative quality. Extensive experiments demonstrate the effectiveness of our approach. Notably, on the GenEval dataset, InjectFlow successfully fixes 75% of the prompts that standard flow matching models fail to generate correctly. Ultimately, our theoretical analysis and algorithm provide a ready-to-use solution for building more fair and robust visual foundation models.

253. 【2603.20292】HSI Image Enhancement Classification Based on Knowledge Distillation: A Study on Forgetting

链接https://arxiv.org/abs/2603.20292

作者:Songfeng Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:unavoidable challenge, tasks for hyperspectral, catastrophic forgetting, Abstract, samples

备注: 18pages,7figures

点击查看摘要

Abstract:In incremental classification tasks for hyperspectral images, catastrophic forgetting is an unavoidable challenge. While memory recall methods can mitigate this issue, they heavily rely on samples from old categories. This paper proposes a teacher-based knowledge retention method for incremental image classification. It alleviates model forgetting of old category samples by utilizing incremental category samples, without depending on old category samples. Additionally, this paper introduces a mask-based partial category knowledge distillation algorithm. By decoupling knowledge distillation, this approach filters out potentially misleading information that could misguide the student model, thereby enhancing overall accuracy. Comparative and ablation experiments demonstrate the proposed method's robust performance.

254. 【2603.20290】ransparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly

链接https://arxiv.org/abs/2603.20290

作者:Qihao Lin,Borui Chen,Yuping Zhou,Jianing Wu,Yulan Guo,Weishi Zheng,Chongkun Xia

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:cultural relic restoration, optical instrument repair, contour estimation, precision optical instrument, device broken accidents

备注: 17 pages, 22 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment's contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at this https URL.

255. 【2603.20289】Remote Sensing Image Dehazing: A Systematic Review of Progress, Challenges, and Prospects

链接https://arxiv.org/abs/2603.20289

作者:Heng Zhou,Xiaoxiong Liu,Zhenxi Zhang,Jieheng Yun,Chengyang Li,Yunchu Yang,Dongyi Xia,Chunna Tian,Xiao-Jun Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote sensing images, hinder downstream applications, obscure surface reflectance, Remote sensing, sensing images

备注: 82 pages, 23 figures,

点击查看摘要

Abstract:Remote sensing images (RSIs) are frequently degraded by haze, fog, and thin clouds, which obscure surface reflectance and hinder downstream applications. This study presents the first systematic and unified survey of RSIs dehazing, integrating methodological evolution, benchmark assessment, and physical consistency analysis. We categorize existing approaches into a three-stage progression: from handcrafted physical priors, to data-driven deep restoration, and finally to hybrid physical-intelligent generation, and summarize more than 30 representative methods across CNNs, GANs, Transformers, and diffusion models. To provide a reliable empirical reference, we conduct large-scale quantitative experiments on five public datasets using 12 metrics, including PSNR, SSIM, CIEDE, LPIPS, FID, SAM, ERGAS, UIQI, QNR, NIQE, and HIST. Cross-domain comparison reveals that recent Transformer- and diffusion-based models improve SSIM by 12%~18% and reduce perceptual errors by 20%~35% on average, while hybrid physics-guided designs achieve higher radiometric stability. A dedicated physical radiometric consistency experiment further demonstrates that models with explicit transmission or airlight constraints reduce color bias by up to 27%. Based on these findings, we summarize open challenges: dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations, and outline promising research directions for future development of trustworthy, controllable, and efficient (TCE) dehazing systems. All reviewed resources, including source code, benchmark datasets, evaluation metrics, and reproduction configurations are publicly available at this https URL.

256. 【2603.20288】Efficient Visual Anomaly Detection at the Edge: Enabling Real-Time Industrial Inspection on Resource-Constrained Devices

链接https://arxiv.org/abs/2603.20288

作者:Arianna Stropeni,Fabrizio Genilotti,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Anomaly Detection, automatic defect detection, Visual Anomaly, Anomaly Detection, defect detection

备注

点击查看摘要

Abstract:Visual Anomaly Detection (VAD) is essential for industrial quality control, enabling automatic defect detection in manufacturing. In real production lines, VAD systems must satisfy strict real-time and privacy requirements, necessitating a shift from cloud-based processing to local edge deployment. However, processing data locally on edge devices introduces new challenges because edge hardware has limited memory and computational resources. To overcome these limitations, we propose two efficient VAD methods designed for edge deployment: PatchCore-Lite and Padim-Lite, based on the popular PatchCore and PaDiM models. PatchCore-Lite runs first a coarse search on a product-quantized memory bank, then an exact search on a decoded subset. Padim-Lite is sped up using diagonal covariance, turning Mahalanobis distance into efficient element-wise computation. We evaluate our methods on the MVTec AD and VisA benchmarks and show their suitability for edge environments. PatchCore-Lite achieves a remarkable 79% reduction in total memory footprint, while PaDiM-Lite achieves substantial efficiency gains with a 77% reduction in total memory and a 31% decrease in inference time. These results show that VAD can be effectively deployed on edge devices, enabling real-time, private, and cost-efficient industrial inspection.

257. 【2603.20284】STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

链接https://arxiv.org/abs/2603.20284

作者:Runze Wang,Yuxuan Song,Youcheng Cai,Ligang Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)

关键词:efficient memory usage, streaming inputs requires, inputs requires, Online, temporal consistency

备注: 10 pages, 6 figures. Accepted by CVPR 2026

点击查看摘要

Abstract:Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.

Comments:
10 pages, 6 figures. Accepted by CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)

Cite as:
arXiv:2603.20284 [cs.CV]

(or
arXiv:2603.20284v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.20284

Focus to learn more

              arXiv-issued DOI via DataCite</p>
258. 【2603.20280】Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs

链接https://arxiv.org/abs/2603.20280

作者:Danial Monachan,Samira Nazari,Mahdi Taheri,Ali Azarpeyvand,Milos Krstic,Michael Huebner,Christian Herglotz

类目:Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG)

关键词:Deploying deep neural, deep neural networks, edge devices requires, devices requires strong, requires strong compression

备注

点击查看摘要

Abstract:Deploying deep neural networks (DNNs) on edge devices requires strong compression with minimal accuracy loss. This paper introduces Mix-and-Match Pruning, a globally guided, layer-wise sparsification framework that leverages sensitivity scores and simple architectural rules to generate diverse, high-quality pruning configurations. The framework addresses a key limitation that different layers and architectures respond differently to pruning, making single-strategy approaches suboptimal. Mix-and-Match derives architecture-aware sparsity ranges, e.g., preserving normalization layers while pruning classifiers more aggressively, and systematically samples these ranges to produce ten strategies per sensitivity signal (magnitude, gradient, or their combination). This eliminates repeated pruning runs while offering deployment-ready accuracy-sparsity trade-offs. Experiments on CNNs and Vision Transformers demonstrate Pareto-optimal results, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning. These findings show that coordinating existing pruning signals enables more reliable and efficient compressed models than introducing new criteria.

259. 【2603.20275】Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection

链接https://arxiv.org/abs/2603.20275

作者:Saeed Khaki,Nima Safaei,Kamal Ginotra

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains poorly understood, require tight coupling, removing specific decoder, substantial depth redundancy, layers remains poorly

备注

点击查看摘要

Abstract:Transformer-based vision-language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi-step reasoning. We study structured decoder layer pruning through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs. This yields simple math-aware, non-math-aware, and mixed ranking criteria that identify layers whose input-output activations change least within a target domain. Across two state-of-the-art VLMs and a broad suite of math and general multimodal benchmarks, we uncover a consistent three-regime structure: at low pruning budgets, performance is highly sensitive to which layers are removed; at moderate budgets, methods converge as structural damage accumulates; and at high budgets, structural continuity dominates, favoring spacing-aware strategies. Our domain-aware rankings achieve the strongest stability in the ranking-sensitive regime, while matching or exceeding structure-aware baselines at larger budgets. These results provide a clearer picture of how depth contributes to domain-specific behavior in VLMs and offer a practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision-language capabilities.

260. 【2603.20273】Efficient AI-Driven Multi-Section Whole Slide Image Analysis for Biochemical Recurrence Prediction in Prostate Cancer

链接https://arxiv.org/abs/2603.20273

作者:Yesung Cho,Dongmyung Shin,Sujeong Hong,Jooyeon Lee,Seongmin Park,Geongyu Lee,Jongbae Park,Hong Koo Ha

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:frequently diagnosed malignancies, men worldwide, frequently diagnosed, diagnosed malignancies, malignancies in men

备注

点击查看摘要

Abstract:Prostate cancer is one of the most frequently diagnosed malignancies in men worldwide. However, precise prediction of biochemical recurrence (BCR) after radical prostatectomy remains challenging due to the multifocality of tumors distributed throughout the prostate gland. In this paper, we propose a novel AI framework that simultaneously processes a series of multi-section pathology slides to capture the comprehensive tumor landscape across the entire prostate gland. To develop this predictive AI model, we curated a large-scale dataset of 23,451 slides from 789 patients. The proposed framework demonstrated strong predictive performance for 1- and 2-year BCR prediction, substantially outperforming established clinical benchmarks. The AI-derived risk score was validated as the most potent independent prognostic factor in a multivariable Cox proportional hazards analysis, surpassing conventional clinical markers such as pre-operative PSA and Gleason score. Furthermore, we demonstrated that integrating patch and slide sub-sampling strategies significantly reduces computational cost during both training and inference without compromising predictive performance, and generalizability of AI was confirmed through external validation. Collectively, these results highlight the clinical feasibility and prognostic value of the proposed AI-based multi-section slide analysis as a scalable tool for post-operative management in prostate cancer.

261. 【2603.20239】Rheos: Modelling Continuous Motion Dynamics in Hierarchical 3D Scene Graphs

链接https://arxiv.org/abs/2603.20239

作者:Iacopo Catalano,Francesco Verdoja,Javier Civera,Jorge Peña-Queralta,Julio A. Placed

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:tracking individual agents, dynamics remains limited, Scene Graphs, multi-resolution abstractions, individual agents

备注

点击查看摘要

Abstract:3D Scene Graphs (3DSGs) provide hierarchical, multi-resolution abstractions that encode the geometric and semantic structure of an environment, yet their treatment of dynamics remains limited to tracking individual agents. Maps of Dynamics (MoDs) complement this by modeling aggregate motion patterns, but rely on uniform grid discretizations that lack semantic grounding and scale poorly. We present Rheos, a framework that explicitly embeds continuous directional motion models into an additional dynamics layer of a hierarchical 3DSG that enhances the navigational properties of the graph. Each dynamics node maintains a semi-wrapped Gaussian mixture model that captures multimodal directional flow as a principled probability distribution with explicit uncertainty, replacing the discrete histograms used in prior work. To enable online operation, Rheos employs reservoir sampling for bounded-memory observation buffers, parallel per-cell model updates and a principled Bayesian Information Criterion (BIC) sweep that selects the optimal number of mixture components, reducing per-update initialization cost from quadratic to linear in the number of samples. Evaluated across four spatial resolutions in a simulated pedestrian environment, Rheos consistently outperforms the discrete baseline under continuous as well as unfavorable discrete metrics. We release our implementation as open source.

262. 【2603.20201】FIGURA: A Modular Prompt Engineering Method for Artistic Figure Photography in Safety-Filtered Text-to-Image Models

链接https://arxiv.org/abs/2603.20201

作者:Luca Cazzaniga

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:models systematically block, treating classical nude, systematically block legitimate, classical nude photography, legitimate artistic content

备注: 10 pages, 6 tables. Preprint

点击查看摘要

Abstract:Safety filters in commercial text-to-image (T2I) models systematically block legitimate artistic content involving the human figure, treating classical nude photography with the same restrictiveness as explicit material. While prior research has documented this problem extensively, no operational system exists that enables professional artists to generate artistic figure photography within the constraints of active safety filters. We present the FIGURA Method (Framework for Intelligent Generation of Unrestricted Artistic Results), a modular prompt engineering system comprising eight interconnected knowledge files, empirically validated through 200+ documented generation tests on FLUX 2 Pro (Cloud) with active safety filters at the default tolerance level. Our systematic testing reveals several previously undocumented findings: (1) safety filters primarily detect absence descriptions (references to missing clothing) rather than presence descriptions (references to body form), which we formalize as the Golden Rule; (2) artistic references to painters function simultaneously as aesthetic guides and as safety anchors that alter filter behavior; (3) spatial context operates as an independent filter variable, with documented success rate hierarchies; and (4) geometric vocabulary for body description bypasses pattern recognition in silhouette contexts. The system achieves documented success rates between 80% and 90% across five structured prompt templates, demonstrating that the artistic censorship problem identified in recent literature admits practical, systematic solutions that work with active safety mechanisms rather than circumventing them.

263. 【2603.20200】Your Robot Will Feel You Now: Empathy in Robots and Embodied Agents

链接https://arxiv.org/abs/2603.20200

作者:Angelica Lim,Ö. Nilay Yalçin

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:embodied conversational agents, human-robot interaction, embodied conversational, long studied, studied how empathy

备注: Accepted manuscript. Chapter in "Empathy and Artificial Intelligence: Challenges, Advances and Ethical Considerations" edited by Anat Perry; C. Daryl Cameron

点击查看摘要

Abstract:The fields of human-robot interaction (HRI) and embodied conversational agents (ECAs) have long studied how empathy could be implemented in machines. One of the major drivers has been the goal of giving multimodal social and emotional intelligence to these artificially intelligent agents, which interact with people through facial expressions, body, gesture, and speech. What empathic behaviors and models have these fields implemented by mimicking human and animal behavior? In what ways have they explored creating machine-specific analogies? This chapter aims to review the knowledge from these studies, towards applying the lessons learned to today's ubiquitous, language-based agents such as ChatGPT.

264. 【2603.20198】Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

链接https://arxiv.org/abs/2603.20198

作者:Yunbei Zhang,Yingqiang Ge,Weijie Xu,Yuhui Xu,Jihun Hamm,Chandan K. Reddy

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:red teaming treats, teaming treats images, multimodal red teaming, adversarial noise, red teaming

备注

点击查看摘要

Abstract:Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed. We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis. MM-Plan trains an attacker planner to synthesize comprehensive, multi-turn strategies, optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. To rigorously benchmark this reasoning-dependent threat, we introduce VE-Safety, a human-curated dataset filling a critical gap in evaluating high-risk technical visual understanding. MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2--5x where existing methods largely fail. These findings reveal that frontier models remain vulnerable to agentic multimodal attacks, exposing a critical gap in current safety alignment. Warning: This paper contains potentially harmful content.

265. 【2504.11289】UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

链接https://arxiv.org/abs/2504.11289

作者:Xiang Wang,Shiwei Zhang,Longxiang Tang,Yingya Zhang,Changxin Gao,Yuehuan Wang,Nong Sang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:report presents UniAnimate-DiT, human image animation, report presents, advanced project, project that leverages

备注: The training and inference code (based on Wan2.1) is available at [this https URL](https://github.com/ali-vilab/UniAnimate-DiT)

点击查看摘要

Abstract:This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.

266. 【2603.21891】HMS-VesselNet: Hierarchical Multi-Scale Attention Network with Topology-Preserving Loss for Retinal Vessel Segmentation

链接https://arxiv.org/abs/2603.21891

作者:Amarnath R

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Retinal vessel segmentation, overlap losses tend, segmentation methods based, Retinal vessel, losses tend

备注: 19 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Retinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.

267. 【2603.21760】Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration

链接https://arxiv.org/abs/2603.21760

作者:Jiaqi Shang,Haojin Wu,Yinyi Lai,Zongyu Li,Chenghao Zhang,Jia Guo

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image analysis, image registration plays, Deformable image registration, medical image, image analysis

备注

点击查看摘要

Abstract:Deformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.

268. 【2603.21510】Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability

链接https://arxiv.org/abs/2603.21510

作者:Jiahui Song,Sagar Shrestha,Xiao Fu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:roughly overlapping regions, covering roughly overlapping, unregistered hyperspectral image, hyperspectral image, multispectral image

备注

点击查看摘要

Abstract:This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.

269. 【2603.21235】Domain Elastic Transform: Bayesian Function Registration for High-Dimensional Scientific Data

链接https://arxiv.org/abs/2603.21235

作者:Osamu Hirose,Emanuele Rodola

类目:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:aligns continuous intensity, point set registration, aligns sparse geometries, continuous intensity fields, aligns continuous

备注

点击查看摘要

Abstract:Nonrigid registration is conventionally divided into point set registration, which aligns sparse geometries, and image registration, which aligns continuous intensity fields on regular grids. However, this dichotomy creates a critical bottleneck for emerging scientific data, such as spatial transcriptomics, where high-dimensional vector-valued functions, e.g., gene expression, are defined on irregular, sparse manifolds. Consequently, researchers currently face a forced choice: either sacrifice single-cell resolution via voxelization to utilize image-based tools, or ignore the critical functional signal to utilize geometric tools. To resolve this dilemma, we propose Domain Elastic Transform (DET), a grid-free probabilistic framework that unifies geometric and functional alignment. By treating data as functions on irregular domains, DET registers high-dimensional signals directly without binning. We formulate the problem within a rigorous Bayesian framework, modeling domain deformation as an elastic motion guided by a joint spatial-functional likelihood. The method is fully unsupervised and scalable, utilizing feature-sensitive downsampling to handle massive atlases. We demonstrate that DET achieves 92\% topological preservation on MERFISH data where state-of-the-art optimal transport methods struggle ($$5\%), and successfully registers whole-embryo Stereo-seq atlases across developmental stages -- a task involving massive scale and complex nonrigid growth. The implementation of DET is available on {this https URL} (since Mar, 2025).

270. 【2603.20263】MiSiSUn: Minimum Simplex Semisupervised Unmixing

链接https://arxiv.org/abs/2603.20263

作者:Behnood Rasti,Bikram Koirala,Paul Scheunders

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:approach called minimum, called minimum simplex, geometric unmixing approach, unmixing approach called, minimum simplex semisupervised

备注

点击查看摘要

Abstract:This paper proposes a semisupervised geometric unmixing approach called minimum simplex semisupervised unmixing (MiSiSUn). The geometry of the data was incorporated for the first time into library-based unmixing using a simplex-volume-flavored penalty based on an archetypal analysis-type linear model. The experimental results were performed on two simulated datasets considering different levels of mixing ratios and spatial instruction at varying input noise. MiSiSUn considerably outperforms state-of-the-art semisupervised unmixing methods. The improvements vary from 1 dB to over 3 dB in different scenarios. The proposed method was also applied to a real dataset where visual interpretation is close to the geological map. MiSiSUn was implemented using PyTorch, which is open-source and available at this https URL. Moreover, we provide a dedicated Python package for Semisupervised Unmixing, which is open-source and includes all the methods used in the experiments for the sake of reproducibility.