本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新631篇论文，其中：

自然语言处理116篇
信息检索12篇
计算机视觉113篇

自然语言处理

1. 【2603.05500】POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

作者：Zeju Qiu,Lixin Liu,Adrian Weller,Han Shi,Weiyang Liu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, machine learning systems, modern machine learning, Efficient and stable, Reparameterized Orthogonal Equivalence

备注： Technical report v1 (14 pages, 7 figures, project page: [this https URL](https://spherelab.ai/poetx/) )

点击查看摘要

Abstract:Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

2. 【2603.05498】he Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

链接：https://arxiv.org/abs/2603.05498

作者：Shangwen Sun,Alfredo Canziani,Yann LeCun,Jiachen Zhu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：exhibit extreme outliers, tokens exhibit extreme, tokens attract disproportionate, Transformer language models, attract disproportionate attention

备注：

点击查看摘要

Abstract:We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

3. 【2603.05494】Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

链接：https://arxiv.org/abs/2603.05494

作者：Helena Casademunt,Bartosz Cywiński,Khoi Tran,Arya Jakkli,Samuel Marks,Neel Nanda

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, Large language, Large, lie detection, lie

备注：

点击查看摘要

Abstract:Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

4. 【2603.05488】Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

链接：https://arxiv.org/abs/2603.05488

作者：Siddharth Boppana,Annabel Ma,Max Loeffler,Raphael Sarfati,Eric Bigelow,Atticus Geiger,Owen Lewis,Jack Merullo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：continues generating tokens, model final answer, provide evidence, strongly confident, continues generating

备注：

点击查看摘要

Abstract:We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

5. 【2603.05471】Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

链接：https://arxiv.org/abs/2603.05471

作者：Artem Vazhentsev,Maria Marina,Daniil Moskovskiy,Sergey Pletenev,Mikhail Seleznyov,Mikhail Salnikov,Elena Tutubalina,Vasily Konovalov,Irina Nikishina,Alexander Panchenko,Viktor Moskvoretskii

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, built on Large, Large Language, core research challenge, natural language claims

备注： Preprint

点击查看摘要

Abstract:Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

6. 【2603.05462】NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

链接：https://arxiv.org/abs/2603.05462

作者：Abrar Eyasir,Tahsin Ahmed,Muhammad Ibrahim

类目：Computation and Language (cs.CL)

关键词：Reading comprehension systems, languages face significant, face significant challenges, Reading comprehension, Bangladesh National Curriculum

备注： 18 pages, 7 figures, 6 tables. Dataset contains 87,805 Bangla QA pairs from NCTB textbooks

点击查看摘要

Abstract:Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.

7. 【2603.05459】DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

链接：https://arxiv.org/abs/2603.05459

作者：Klaywert Danillo Ferreira de Souza,David Eduardo Pereira,Cláudio E. C. Campelo,Larissa Lucena Vasconcelos

类目：Computation and Language (cs.CL); Databases (cs.DB)

关键词：simple everyday discussions, work activities, daily lives, simple everyday, social networks

备注：

点击查看摘要

Abstract:The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.

8. 【2603.05451】FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

链接：https://arxiv.org/abs/2603.05451

作者：Ted Zadouri,Markus Hoehnerbach,Jay Shah,Timmy Liu,Vijay Thakkar,Tri Dao

类目：Computation and Language (cs.CL)

关键词：ubiquitous Transformer architecture, large language models, ubiquitous Transformer, Transformer architecture, long-context applications

备注：

点击查看摘要

Abstract:Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

9. 【2603.05450】Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

链接：https://arxiv.org/abs/2603.05450

作者：Yifan Zhu,Mariah Bradford,Kenneth Lai,Timothy Obiso,Videep Venkatesha,James Pustejovsky,Nikhil Krishnaswamy

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：mutually recognized facts, Establishing common ground, Partial Information Puzzle, Distributed Partial Information, multiparty settings

备注： 10 pages, 4 figures

点击查看摘要

Abstract:Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

10. 【2603.05432】Ensembling Language Models with Sequential Monte Carlo

链接：https://arxiv.org/abs/2603.05432

作者：Robin Shing Moon Chan,Tianyu Liu,Samuel Kiegeland,Clemente Pasti,Jacob Hoover Vigly,Timothy J. O'Donnell,Ryan Cotterell,Tim Vieira

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Practitioners have access, prior work shows, language modeling tasks, highly sensitive, Practitioners

备注：

点击查看摘要

Abstract:Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.

11. 【2603.05414】Dissociating Direct Access from Inference in AI Introspection

链接：https://arxiv.org/abs/2603.05414

作者：Harvey Lederman,Kyle Mahowald

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：foundational cognitive ability, cognitive ability, foundational cognitive, Abstract, mechanism

备注：

点击查看摘要

Abstract:Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.

12. 【2603.05400】An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs

链接：https://arxiv.org/abs/2603.05400

作者：Deshan Sumanathilaka,Nicholas Micallef,Julian Hough

类目：Computation and Language (cs.CL)

关键词：Natural Language Processing, Word Sense Disambiguation, Language Processing, Natural Language, challenge in Natural

备注： Accepted at LREC 2026, 15 pages, 11 Tables

点击查看摘要

Abstract:Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.

13. 【2603.05369】Progressive Residual Warmup for Language Model Pretraining

链接：https://arxiv.org/abs/2603.05369

作者：Tianhao Chen,Xin Xu,Lu Yin,Hao Chen,Yang Wang,Shizhe Diao,Can Yang

类目：Computation and Language (cs.CL)

关键词：Transformer architectures serve, modern Large Language, Large Language Models, Transformer architectures, Large Language

备注：

点击查看摘要

Abstract:Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at this https URL.

14. 【2603.05357】DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

链接：https://arxiv.org/abs/2603.05357

作者：Mohammad Mahdi Moradi,Sudhir Mudur

类目：Computation and Language (cs.CL)

关键词：heterogeneous reasoning problems, large language models, improving reasoning performance, additional supervision, leading to inefficient

备注：

点击查看摘要

Abstract:Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.

15. 【2603.05354】Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

链接：https://arxiv.org/abs/2603.05354

作者：Carlos Carvalho,Francisco Teixeira,Thomas Rolland,Alberto Abad

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：multiple specialised models, scalable alternative, alternative to multi-task, multi-task training, training that combines

备注： submitted for review for INTERSPEECH2026 conference

点击查看摘要

Abstract:Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.

16. 【2603.05345】A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes

链接：https://arxiv.org/abs/2603.05345

作者：Stefan Bott,Verena Riegler,Horacio Saggion,Almudena Rascón Alcaina,Nouran Khallaf

类目：Computation and Language (cs.CL)

关键词：life and society, understand information, key factor, self-determined life, corpus

备注： Will be published in LREC26

点击查看摘要

Abstract:Being able to understand information is a key factor for a self-determined life and society. It is also very important for participating in democratic processes. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of original texts for these 3 languages, with high quality simplification produced by human experts in text simplification. It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to E2R level. The corpus is particularity valuable because it includes the first annotated corpus of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpus will be made freely accessible to the public.

17. 【2603.05314】PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

链接：https://arxiv.org/abs/2603.05314

作者：Mohammad Javad Ranjbar Kalahroodi,Heshaam Faili,Azadeh Shakery

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：automatic speech recognition, Persian punctuation restoration, Punctuation restoration, speech recognition, essential for improving

备注：

点击查看摘要

Abstract:Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (this https URL) and model (this https URL) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.

18. 【2603.05308】Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

链接：https://arxiv.org/abs/2603.05308

作者：Qiao Jin,Yin Fang,Lauren He,Yifan Yang,Guangzhi Xiong,Zhizheng Wang,Nicholas Wan,Joey Chan,Donald C. Comeau,Robert Leaman,Charalampos S. Floudas,Aidong Zhang,Michael F. Chiang,Yifan Peng,Zhiyong Lu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：article supports, supports an assertion, assertion is essential, Assessing, language models

备注：

点击查看摘要

Abstract:Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at this https URL.

19. 【2603.05299】WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

链接：https://arxiv.org/abs/2603.05299

作者：Luca Della Libera,Cem Subakan,Mirco Ravanelli

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词：Large language models, remains challenging due, Large language, speech remains challenging, language models show

备注： 6 pages, 1 figure

点击查看摘要

Abstract:Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at this https URL.

20. 【2603.05293】Knowledge Divergence and the Value of Debate for Scalable Oversight

链接：https://arxiv.org/abs/2603.05293

作者：Robin Young

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：formal framework relates, advanced AI systems, debate, reinforcement learning, framework relates

备注：

点击查看摘要

Abstract:AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage. We analyze this by parameterizing debate's value through the geometry of knowledge divergence between debating models. Using principal angles between models' representation subspaces, we prove that the debate advantage admits an exact closed form. When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum. When models possess divergent knowledge, debate advantage scales with a phase transition from quadratic regime (debate offers negligible benefit) to linear regime (debate is essential). We classify three regimes of knowledge divergence (shared, one-sided, and compositional) and provide existence results showing that debate can achieve outcomes inaccessible to either model alone, alongside a negative result showing that sufficiently strong adversarial incentives cause coordination failure in the compositional regime, with a sharp threshold separating effective from ineffective debate. We offer the first formal connection between debate and RLAIF, a geometric foundation for understanding when adversarial oversight protocols are justified, and connection to the problem of eliciting latent knowledge across models with complementary information.

21. 【2603.05275】SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

链接：https://arxiv.org/abs/2603.05275

作者：Zhu Li,Yongjian Chen,Huiyuan Lai,Xiyuan Gao,Shekhar Nayak,Matt Coler

类目：Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD)

关键词：requires resolving pragmatic, resolving pragmatic incongruity, detection requires resolving, sarcasm detection requires, incongruity across textual

备注：

点击查看摘要

Abstract:Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.

22. 【2603.05272】Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

链接：https://arxiv.org/abs/2603.05272

作者：Mohammad Mamun Or Rashid

类目：Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词：Multilingual Cloud Corpus, Bangladesh ethnic, Multilingual Cloud, ethnic and indigenous, multimodal linguistic dataset

备注：

点击查看摘要

Abstract:We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (this http URL), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.

23. 【2603.05262】VietJobs: A Vietnamese Job Advertisement Dataset

链接：https://arxiv.org/abs/2603.05262

作者：Hieu Pham Dinh,Hung Nguyen Huy,Mo El-Haj

类目：Computation and Language (cs.CL)

关键词：million words collected, Vietnamese job advertisements, municipalities across Vietnam, publicly available corpus, million words

备注： 10 pages

点击查看摘要

Abstract:VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: this https URL.

24. 【2603.05210】Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

链接：https://arxiv.org/abs/2603.05210

作者：Ofir Ben Shoham

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, inference for Large, Large Language, decoding accelerates inference, Speculative decoding accelerates

备注：

点击查看摘要

Abstract:Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.

25. 【2603.05207】Core-based Hierarchies for Efficient GraphRAG

链接：https://arxiv.org/abs/2603.05207

作者：Jakir Hossain,Ahmet Erdem Sarıyüce

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：enhances large language, incorporating external knowledge, large language models, enhances large, large language

备注：

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. However, existing vector-based methods often fail on global sensemaking tasks that require reasoning across many documents. GraphRAG addresses this by organizing documents into a knowledge graph with hierarchical communities that can be recursively summarized. Current GraphRAG approaches rely on Leiden clustering for community detection, but we prove that on sparse knowledge graphs, where average degree is constant and most nodes have low degree, modularity optimization admits exponentially many near-optimal partitions, making Leiden-based communities inherently non-reproducible. To address this, we propose replacing Leiden with k-core decomposition, which yields a deterministic, density-aware hierarchy in linear time. We introduce a set of lightweight heuristics that leverage the k-core hierarchy to construct size-bounded, connectivity-preserving communities for retrieval and summarization, along with a token-budget-aware sampling strategy that reduces LLM costs. We evaluate our methods on real-world datasets including financial earnings transcripts, news articles, and podcasts, using three LLMs for answer generation and five independent LLM judges for head-to-head evaluation. Across datasets and models, our approach consistently improves answer comprehensiveness and diversity while reducing token usage, demonstrating that k-core-based GraphRAG is an effective and efficient framework for global sensemaking.

26. 【2603.05198】Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic

链接：https://arxiv.org/abs/2603.05198

作者：Sara Candussio,Gabriele Sarti,Gaia Saveri,Luca Bortolussi

类目：Computation and Language (cs.CL); Symbolic Computation (cs.SC)

关键词：learning continuous neural, latent space, formal specifications, specifications by distilling, distilling the geometry

备注：

点击查看摘要

Abstract:We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels -- which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible -- or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel's logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.

27. 【2603.05197】Diffusion LLMs can think EoS-by-EoS

链接：https://arxiv.org/abs/2603.05197

作者：Sarah Breckner,Sebastian Schuster

类目：Computation and Language (cs.CL)

关键词：complex reasoning tasks, EoS tokens, interdependent sub-goals, alternative to autoregressive, diffusion models

备注：

点击查看摘要

Abstract:Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs' reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.

28. 【2603.05193】ransducing Language Models

链接：https://arxiv.org/abs/2603.05193

作者：Vésteinn Snæbjarnarson,Samuel Kiegeland,Tianyu Liu,Reda Boumasmoud,Ryan Cotterell,Tim Vieira

类目：Computation and Language (cs.CL)

关键词：Modern language models, Modern language, language models, downstream tasks, tasks often require

备注：

点击查看摘要

Abstract:Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.

29. 【2603.05171】Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

链接：https://arxiv.org/abs/2603.05171

作者：Kun Chen,Xianglei Liao,Kaixue Fei,Yi Xing,Xinrui Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：operational annotation framework, proposes a systematic, systematic and operational, judicial decisions, judicial reasoning

备注： The PDF contains both an English translation and the original Chinese guideline. The first 30 pages present the full English translation, while the remaining 25 pages provide the original Chinese version

点击查看摘要

Abstract:This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

30. 【2603.05168】Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

链接：https://arxiv.org/abs/2603.05168

作者：Di Zhang,Xun Wu,Shaohan Huang,Yudong Wang,Hanyong Shao,Yingbo Hao,Zewen Chi,Li Dong,Ting Song,Yan Xia,Zhifang Sui,Furu Wei

类目：Computation and Language (cs.CL)

关键词：large language models, studied in isolation, approaches for improving, improving the efficiency, efficiency of large

备注：

点击查看摘要

Abstract:Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at this https URL

31. 【2603.05167】C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

链接：https://arxiv.org/abs/2603.05167

作者：Avni Mittal,Rauno Arike

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, reliably assess process, assess process faithfulness, Large language, answer plausibility

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

32. 【2603.05143】Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers

链接：https://arxiv.org/abs/2603.05143

作者：Ruichen Xu,Wenjing Yan,Ying-Jun Angela Zhang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：large language models, multiple reasoning types, conflate multiple reasoning, Understanding reasoning, large language

备注：

点击查看摘要

Abstract:Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.

33. 【2603.05136】Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions

链接：https://arxiv.org/abs/2603.05136

作者：Theresa Elstner,Martin Potthast

类目：Computation and Language (cs.CL)

关键词：validating algorithmic decisions, Representation Fidelity, paper introduces, dimension for validating, validating algorithmic

备注：

点击查看摘要

Abstract:This paper introduces a new dimension for validating algorithmic decisions about humans by measuring the fidelity of their representations. Representation Fidelity measures if decisions about a person rest on reasonable grounds. We propose to operationalize this notion by measuring the distance between two representations of the same person: (1) an externally prescribed input representation on which the decision is based, and (2) a self-description provided by the human subject of the decision, used solely to validate the input representation. We examine the nature of discrepancies between these representations, how such discrepancies can be quantified, and derive a generic typology of representation mismatches that determine the degree of representation fidelity. We further present the first benchmark for evaluating representation fidelity based on a dataset of loan-granting decisions. Our Loan-Granting Self-Representations Corpus 2025 consists of a large corpus of 30 000 synthetic natural language self-descriptions derived from corresponding representations of applicants in the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.

34. 【2603.05134】LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

链接：https://arxiv.org/abs/2603.05134

作者：Yewen Li,Zhiyi Lyu,Peng Jiang,Qingpeng Cai,Fei Pan,Bo An,Peng Jiang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：online advertising platforms, making manual bidding, manual bidding impractical, intensified competition, making manual

备注：

点击查看摘要

Abstract:The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.

35. 【2603.05121】Measuring the Redundancy of Decoder Layers in SpeechLLMs

链接：https://arxiv.org/abs/2603.05121

作者：Adel Moumen,Guangzhi Sun,Philip C Woodland

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Speech Large Language, Large Language, Speech Large, total parameters

备注：

点击查看摘要

Abstract:Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.

36. 【2603.05099】ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI

链接：https://arxiv.org/abs/2603.05099

作者：Jens Lehmann,Syeda Khushbakht,Nikoo Salehfard,Nur A Zarin Nishat,Dhananjay Bhandiwad,Andrei Aioanei,Sahar Vahdati

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：probes few-shot abstraction, hand-authored puzzles due, few-shot abstraction, small visual grids, Reasoning Corpus

备注：

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.

37. 【2603.05092】Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series

链接：https://arxiv.org/abs/2603.05092

作者：Jiafeng Lin,Mengren Zheng,Simeng Ye,Yuxuan Wang,Huan Zhang,Yuhui Liu,Zhongyi Pei,Jianmin Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Time series, informed decision-making, witnessed an increasing, increasing demand, accurate predictions

备注：

点击查看摘要

Abstract:Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence temporal dynamics through distinct interaction modes. Based on this empirical insight, we propose Aura, a universal framework that explicitly organizes and encodes heterogeneous external information according to its interaction mode with the target time series. Specifically, Aura utilizes a tailored tripartite encoding mechanism to embed heterogeneous features into well-established time series models, ensuring seamless integration of non-sequential context. Extensive experiments on a large-scale, three-year industrial dataset from China Southern Airlines, covering the Boeing 777 and Airbus A320 fleets, demonstrate that Aura consistently achieves state-of-the-art performance across all baselines and exhibits superior adaptability. Our findings highlight Aura's potential as a general-purpose enhancement for aviation safety and reliability.

38. 【2603.05057】MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

链接：https://arxiv.org/abs/2603.05057

作者：Inayat Arshad,Fajar Saleem,Ijaz Hussain

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Urdu toxic span, existing systems rely, toxic span detection, Urdu toxic, detection remains limited

备注： 29 pages, 7 figures, 13 tables

点击查看摘要

Abstract:Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

39. 【2603.05046】NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension

链接：https://arxiv.org/abs/2603.05046

作者：Rongzhi Li,Hitomi Yanaka

类目：Computation and Language (cs.CL)

关键词：Extending large language, Extending large, training separate models, global accessibility, prohibitively expensive

备注：

点击查看摘要

Abstract:Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.

40. 【2603.05028】Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

链接：https://arxiv.org/abs/2603.05028

作者：Yida Lu,Jianwei Fang,Xuyang Shao,Zixuan Chen,Shiyao Cui,Shanshan Bian,Guangyao Su,Pei Ke,Han Qiu,Minlie Huang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, survival pressure, evolve from chatbots, agentic assistants

备注：

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state-of-the-art LLMs can misbehave under survival pressure, a comprehensive and in-depth investigation into such misbehaviors in real-world scenarios remains scarce. In this paper, we study these survival-induced misbehaviors, termed as SURVIVE-AT-ALL-COSTS, with three steps. First, we conduct a real-world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real-world scenarios, to systematically evaluate SURVIVE-AT-ALL-COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE-AT-ALL-COSTS misbehaviors by correlating them with model's inherent self-preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models, demonstrates the tangible real-world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at this https URL.

41. 【2603.04996】HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation

链接：https://arxiv.org/abs/2603.04996

作者：Yifan Zhu,Guanting Chen,Bing Wei,Haoran Luo

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, language models perform, short text generation, text generation

备注：

点击查看摘要

Abstract:Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow's effectiveness over baseline methods.

42. 【2603.04992】haiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

链接：https://arxiv.org/abs/2603.04992

作者：Trapoom Ukarapol,Nut Chukamphaeng,Kunat Pipatanakul,Pakhapoom Sarapat

类目：Computation and Language (cs.CL)

关键词：leaving non-English languages, remains largely centered, centered on English, Toggle, large language models

备注： ICLR 2026 Workshop on Principled Design for Trustworthy AI

点击查看摘要

Abstract:The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: this https URL - ThaiSafetyBench Github: this https URL - ThaiSafetyClassifier HuggingFace Model: this https URL - ThaiSafetyBench Leaderboard: this https URL

Comments:
ICLR 2026 Workshop on Principled Design for Trustworthy AI

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2603.04992 [cs.CL]

(or
arXiv:2603.04992v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.04992

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Trapoom Ukarapol [view email] [v1]
Thu, 5 Mar 2026 09:35:50 UTC (4,940 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts, by Trapoom Ukarapol and 3 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.CL

|
next

new
|
recent
| 2026-03

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

43. 【2603.04974】VRM: Teaching Reward Models to Understand Authentic Human Preferences

链接：https://arxiv.org/abs/2603.04974

作者：Biao Liu,Ning Xu,Junming Yang,Hao Xu,Xin Geng

类目：Computation and Language (cs.CL)

关键词：Large Language Models, natural language tasks, diverse natural language, Large Language, achieved remarkable success

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.

44. 【2603.04972】Functionality-Oriented LLM Merging on the Fisher--Rao Manifold

链接：https://arxiv.org/abs/2603.04972

作者：Jiayu Wang,Zuojun Ye,Wenpeng Yin

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：fundamentally parameter-space heuristics, combine multiple fine-tuned, multiple fine-tuned LLMs, Weight-space merging aims, existing approaches remain

备注： 9 pages, 2 figures

点击查看摘要

Abstract:Weight-space merging aims to combine multiple fine-tuned LLMs into a single model without retraining, yet most existing approaches remain fundamentally parameter-space heuristics. This creates three practical limitations. First, linear averaging, task vectors, and related rules operate on Euclidean coordinates, even though the desired goal is to merge functionality, i.e., predictive behaviors across tasks. Second, when the source checkpoints are farther apart or more heterogeneous, Euclidean blends often trigger representation collapse, manifested as activation variance shrinkage and effective-rank degradation, which sharply degrades accuracy. Third, many geometry-inspired methods are most natural for two-model interpolation and do not extend cleanly to merging N2 experts with a principled objective. We address these issues by formulating model merging as computing a weighted Karcher mean on the Fisher--Rao manifold, which is locally equivalent to minimizing a KL-based function distance between predictive distributions. We derive a practical fixed-point algorithm using a lightweight spherical proxy that preserves norms and generalizes directly to multi-expert merging. Across various benchmarks and collapse diagnostics, our method remains stable as the number and heterogeneity of merged models increase, consistently outperforming prior baselines.

Comments:
9 pages, 2 figures

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2603.04972 [cs.LG]

(or
arXiv:2603.04972v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.04972

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

45. 【2603.04971】Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

链接：https://arxiv.org/abs/2603.04971

作者：Yilong Chen,Naibin Gu,Junyuan Shang,Zhenyu Zhang,Yuchen Feng,Jiawei Sheng,Tingwen Liu,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：decouples model capacity, scalability remains limited, Universal Expert Load, decouples model, Virtual Width

备注： 19 pages, 10 figures

点击查看摘要

Abstract:Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

46. 【2603.04969】MPCEval: A Benchmark for Multi-Party Conversation Generation

链接：https://arxiv.org/abs/2603.04969

作者：Minxing Zhang,Yi Yang,Zhuofan Jia,Xuan Yang,Jian Pei,Yuchen Zang,Xingwang Deng,Xianglong Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：increasingly important capability, Multi-party conversation generation, collaborative assistants, critical bottleneck, smart reply

备注：

点击查看摘要

Abstract:Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at this https URL.

47. 【2603.04968】When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

链接：https://arxiv.org/abs/2603.04968

作者：Amirabbas Afzali,Myeongho Jeon,Maria Brbic

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：adapting large language, existing approaches typically, approaches typically depend, weak LLM, large language models

备注： 32 pages, 8 figures, International Conference on Learning Representations 2026

点击查看摘要

Abstract:Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.

48. 【2603.04964】Replaying pre-training data improves fine-tuning

链接：https://arxiv.org/abs/2603.04964

作者：Suhas Kotha,Percy Liang

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：vast amount, limited amount, obtain a language, current paradigm, data

备注：

点击查看摘要

Abstract:To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.

49. 【2603.04957】VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

链接：https://arxiv.org/abs/2603.04957

作者：Jiaxin Fan,Wenpo Song

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：existing approaches rely, Large Multimodal Models, Large Multimodal, generate detailed image, achieved strong performance

备注：

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at this https URL.

50. 【2603.04949】meWarp: Evaluating Web Agents by Revisiting the Past

链接：https://arxiv.org/abs/2603.04949

作者：Md Farhan Ishmam,Kenneth Marino

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：current benchmarks raises, today agents perform, raises the question, web, current benchmarks

备注：

点击查看摘要

Abstract:The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.

51. 【2603.04946】LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

链接：https://arxiv.org/abs/2603.04946

作者：Jinwen Chen(1 and 2),Shuai Gong,Shiwen Zhang(1 and 2),Zheng Zhang,Yachao Zhao,Lingxiang Wang(1 and 2),Haibo Zhou,Yuan Zhan,Wei Lin,Hainan Zhang(1 and 2) ((1) Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, (2) School of Artificial Intelligence, Beihang University, China)

类目：Computation and Language (cs.CL)

关键词：user input prefixes, enhancing user experience, suggestion module plays, reducing user effort, local-life service platforms

备注：

点击查看摘要

Abstract:In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.

52. 【2603.04945】Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

链接：https://arxiv.org/abs/2603.04945

作者：Mengze Hong,Yi Gu,Di Jiang,Hanlin Gu,Chen Jason Zhang,Lu Wang,Zhiyang Su

类目：Computation and Language (cs.CL)

关键词：Training automatic speech, producing multiple local, require effective merging, ensure data privacy, automatic speech recognition

备注： Accepted by ICASSP 2026

点击查看摘要

Abstract:Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.

53. 【2603.04933】AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis

链接：https://arxiv.org/abs/2603.04933

作者：Stavros Gazetas,Giorgos Filandrianos,Maria Lymperaiou,Paraskevi Tzouveli,Athanasios Voulodimos,Giorgos Stamou

类目：Computation and Language (cs.CL)

关键词：Dimensional Aspect Sentiment, Aspect Sentiment Regression, Dimensional Aspect, Aspect Sentiment Triplet, Aspect Sentiment Quadruplet

备注：

点击查看摘要

Abstract:In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.

54. 【2603.04921】AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

链接：https://arxiv.org/abs/2603.04921

作者：Panagiotis Alexios Spanakis,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou

类目：Computation and Language (cs.CL)

关键词：agentic LLM pipeline, jointly extracts psycholinguistic, detects conspiracy endorsement, agentic LLM, LLM pipeline

备注：

点击查看摘要

Abstract:This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an "Anti-Echo Chamber" architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the "Reporter Trap," where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100\% over baseline) on S1 and 0.79 Macro F1 (+49\%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.

55. 【2603.04904】Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

链接：https://arxiv.org/abs/2603.04904

作者：Hiroki Fukui

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：offenders articulate remorse, perpetrator treatment, insight and action, offenders articulate, recurring observation

备注： 89 pages, 4 figures, 4 supplementary figures, 12 supplementary tables; preprint

点击查看摘要

Abstract:In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.

56. 【2603.04897】Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

链接：https://arxiv.org/abs/2603.04897

作者：Arina Kostina,Marios Dikaiakos,Alejandro Porcel,Tassos Stassopoulos

类目：Computation and Language (cs.CL)

关键词：embedded financial behaviors, culturally embedded financial, open-ended interviews plays, financial behaviors, plays a central

备注： Accepted for a poster session at [this http URL](http://BIG.AI) @MIT 2026

点击查看摘要

Abstract:Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.

57. 【2603.04893】Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

链接：https://arxiv.org/abs/2603.04893

作者：Sean Lamont,Christian Walder,Paul Montague,Amir Dezfouli,Michael Norrish

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Diffusion Language Models, mathematical problem solving, Diffusion Language, Language Models, complex reasoning tasks

备注：

点击查看摘要

Abstract:Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at this https URL.

58. 【2603.04857】FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

链接：https://arxiv.org/abs/2603.04857

作者：Yunfan Zhang,Yijie Bei,Jetashree Ravi,Pawel Garbacki

类目：Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：reliable LLM-assisted workflows, enabling reliable LLM-assisted, API-driven settings, output formats, LLM-assisted workflows

备注：

点击查看摘要

Abstract:Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at this http URL to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.

59. 【2603.04855】HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

链接：https://arxiv.org/abs/2603.04855

作者：Yilin Jiang,Fei Tan,Xuanyu Yin,Jing Leng,Aimin Zhou

类目：Computation and Language (cs.CL)

关键词：Distribution-Controllable Persona Generation, emerging as infrastructure, prior work, work often relies, relies on ad-hoc

备注： 46 pages, 7 figures, submitted to ACL2026

点击查看摘要

Abstract:Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at this https URL

60. 【2603.04854】SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

链接：https://arxiv.org/abs/2603.04854

作者：Minduli Lasandi,Nevidu Jayatilleke

类目：Computation and Language (cs.CL)

关键词：Sinhala legislative text, legal documents, Sinhala legislative, million words, legislative text corpus

备注： 18 pages, 8 figures, 18 tables, Accepted paper at the 2nd workshop on Language Models for Low-Resource Languages (LoResLM 2026) @ EACL 2026

点击查看摘要

Abstract:SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.

61. 【2603.04851】Why Is RLHF Alignment Shallow? A Gradient Analysis

链接：https://arxiv.org/abs/2603.04851

作者：Robin Young

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：LLMs shallow, alignment, harm, safety alignment, Abstract

备注：

点击查看摘要

Abstract:Why is safety alignment in LLMs shallow? We prove that gradient-based alignment inherently concentrates on positions where harm is decided and vanishes beyond. Using a martingale decomposition of sequence-level harm, we derive an exact characterization of alignment gradients. The gradient at position $t$ equals the covariance between the conditional expected harm and the score function. This implies that positions beyond the harm horizon where the output's harmfulness is already determined receive zero gradient signal during training. This explains empirical observations that KL divergence between aligned and base models concentrates on early tokens. Consequently, standard alignment objectives cannot produce deep alignment, regardless of optimization quality. We introduce the concept of harm information $I_t$, which quantifies each position's influence on harm, and prove that equilibrium KL divergence tracks this quantity. Finally, we derive an objective based on recovery penalties that creates gradient signal at all positions, providing theoretical grounding for empirically successful data augmentation techniques.

62. 【2603.04828】From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

链接：https://arxiv.org/abs/2603.04828

作者：Ruiqi Zhang,Lingxiang Wang,Hainan Zhang,Zhiming Zheng,Yanyan Lan

类目：Computation and Language (cs.CL)

关键词：mitigating benchmark contamination, addressing copyright concerns, benchmark contamination, LLMs is essential, essential for addressing

备注：

点击查看摘要

Abstract:Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.

63. 【2603.04820】Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

链接：https://arxiv.org/abs/2603.04820

作者：Michael Hardy

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Automated short-answer scoring, Quadratic Weighted Kappa, Automated short-answer, short-answer scoring lags, LLM short-answer scoring

备注：

点击查看摘要

Abstract:Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.

64. 【2603.04814】Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

链接：https://arxiv.org/abs/2603.04814

作者：Natchanon Pollertlam,Witchayut Kornsuwannawit

类目：Computation and Language (cs.CL)

关键词：retrieves structured facts, passing full conversation, full conversation histories, Persistent conversational, long-context large language

备注： 15 pages, 1 figure

点击查看摘要

Abstract:Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.

65. 【2603.04805】Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

链接：https://arxiv.org/abs/2603.04805

作者：Edward Zhang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Attention Gravitational Field, Large Language Models, Gravitational Field, Large Language, Attention Gravitational

备注：

点击查看摘要

Abstract:This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton's Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.

66. 【2603.04799】Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

链接：https://arxiv.org/abs/2603.04799

作者：Nan Hou,Kangfei Zhao,Jiadong Xie,Jeffrey Xu Yu

类目：Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, large corpora, semantic query processing, Large language, query processing

备注：

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unified interface for expressing such queries, among which the semantic filter operator serves as a cornerstone. Given a table T with a natural language predicate e, for each tuple in the relation, the execution of a semantic filter proceeds by constructing an input prompt that combines the predicate e with its content, querying the LLM, and obtaining the binary decision. However, this tuple-by-tuple evaluation necessitates a complete linear scan of the table, incurring prohibitive latency and token costs. Although recent work has attempted to optimize semantic filtering, it still does not break the linear LLM invocation barriers. To address this, we propose Clustering-Sampling-Voting (CSV), a new framework that reduces LLM invocations to sublinear complexity while providing error guarantees. CSV embeds tuples into semantic clusters, samples a small subset for LLM evaluation, and infers cluster-level labels via two proposed voting strategies: UniVote, which aggregates labels uniformly, and SimVote, which weights votes by semantic similarity. Moreover, CSV triggers re-clustering on ambiguous clusters to ensure robustness across diverse datasets. The results conducted on real-world datasets demonstrate that CSV reduces the number of LLM calls by 1.28-355x compared to the state-of-the-art approaches, while maintaining comparable effectiveness in terms of Accuracy and F1 score.

67. 【2603.04783】Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

链接：https://arxiv.org/abs/2603.04783

作者：Xingwu Chen,Zhanqiu Zhang,Yiwen Guo,Difan Zou

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：exhibit substantial vulnerability, LLMs demonstrate strong, textbf, LLMs demonstrate, provided with full

备注：

点击查看摘要

Abstract:While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.

68. 【2603.04775】Privacy-Aware Camera 2.0 Technical Report

链接：https://arxiv.org/abs/2603.04775

作者：Huan Song,Shuyu Tian,Ting Long,Jiang Liu,Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：profound privacy-security paradox, intelligent sensing technologies, highly sensitive environments, surveillance systems face, visual surveillance systems

备注：

点击查看摘要

Abstract:With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a "dynamic contour" visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.

69. 【2603.04772】SEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

链接：https://arxiv.org/abs/2603.04772

作者：Yebo Wu,Feng Liu,Ziwei Xie,Zhiyuan Liu,Changwang Zhang,Jun Wang,Li Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Multimodal Large Language, Large Language Models, Large Language, exceptional reasoning capabilities, Multimodal Large

备注：

点击查看摘要

Abstract:Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.

70. 【2603.04759】Stacked from One: Multi-Scale Self-Injection for Context Window Extension

链接：https://arxiv.org/abs/2603.04759

作者：Wei Han,Pan Zhou,Shuicheng Yan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：contemporary large language, limited context window, large language models, remains a primary, diverse domains

备注：

点击查看摘要

Abstract:The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).

71. 【2603.04750】HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

链接：https://arxiv.org/abs/2603.04750

作者：TheViet Bui,Wenjun Li,Yong Liu

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：LLM agents fail, Sequential LLM agents, LLM agents, Sequential LLM, diversity requirements

备注： 33 pages, v1

点击查看摘要

Abstract:Sequential LLM agents fail on long-horizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP-Travel, a hierarchical multi-agent framework that splits planning into strategic coordination and parallel day-level execution. A Coordinator allocates resources across days, while Day Executors plan independently in parallel. Three key mechanisms enable this: a transactional monitor enforcing budget and uniqueness constraints across parallel agents, a bargaining protocol allowing agents to reject infeasible sub-goals and trigger re-planning, and a single policy trained with GRPO that powers all agents through role conditioning. On TravelPlanner, HiMAP-Travel with Qwen3-8B achieves 52.78% validation and 52.65% test Final Pass Rate (FPR). In a controlled comparison with identical model, training, and tools, it outperforms the sequential DeepTravel baseline by +8.67~pp. It also surpasses ATLAS by +17.65~pp and MTP by +10.0~pp. On FlexTravelBench multi-turn scenarios, it achieves 44.34% (2-turn) and 37.42% (3-turn) FPR while reducing latency 2.5x through parallelization.

72. 【2603.04743】DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

链接：https://arxiv.org/abs/2603.04743

作者：Maojun Sun,Yue Wu,Yifei Xie,Ruijian Han,Binyan Jiang,Defeng Sun,Yancheng Yuan,Jian Huang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, automate data-science workflows, Language Model, rigorous statistical methods

备注： 24 pages,7 figures, 3 tables

点击查看摘要

Abstract:Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

73. 【2603.04738】IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

链接：https://arxiv.org/abs/2603.04738

作者：Bosi Wen,Yilin Niu,Cunxiang Wang,Xiaoying Ling,Ying Zhang,Pei Ke,Hongning Wang,Minlie Huang

类目：Computation and Language (cs.CL)

关键词：large language models, foundational capability, capability of large, large language, improvement hinging

备注： 27 pages, 7 figures

点击查看摘要

Abstract:Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at this https URL.

74. 【2603.04737】Interactive Benchmarks

链接：https://arxiv.org/abs/2603.04737

作者：Baoqing Yue,Zihan Zhu,Yifan Zhang,Jichen Feng,Hufei Yang,Mengdi Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：increasingly unreliable due, Standard benchmarks, due to saturation, poor generalization, increasingly unreliable

备注： Project Page: [this https URL](https://github.com/interactivebench/interactivebench)

点击查看摘要

Abstract:Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: this https URL

75. 【2603.04735】Solving an Open Problem in Theoretical Physics using AI-Assisted Discovery

链接：https://arxiv.org/abs/2603.04735

作者：Michael P. Brenner,Vincent Cohen-Addad,David Woodruff

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：theoretical physics, paper demonstrates, demonstrates that artificial, artificial intelligence, intelligence can accelerate

备注： 22 pages, 3 figures

点击查看摘要

Abstract:This paper demonstrates that artificial intelligence can accelerate mathematical discovery by autonomously solving an open problem in theoretical physics. We present a neuro-symbolic system, combining the Gemini Deep Think large language model with a systematic Tree Search (TS) framework and automated numerical feedback, that successfully derived novel, exact analytical solutions for the power spectrum of gravitational radiation emitted by cosmic strings. Specifically, the agent evaluated the core integral $I(N,\alpha)$ for arbitrary loop geometries, directly improving upon recent AI-assisted attempts \cite{BCE+25} that only yielded partial asymptotic solutions. To substantiate our methodological claims regarding AI-accelerated discovery and to ensure transparency, we detail system prompts, search constraints, and intermittent feedback loops that guided the model. The agent identified a suite of 6 different analytical methods, the most elegant of which expands the kernel in Gegenbauer polynomials $C_l^{(3/2)}$ to naturally absorb the integrand's singularities. The methods lead to an asymptotic result for $I(N,\alpha)$ at large $N$ that both agrees with numerical results and also connects to the continuous Feynman parameterization of Quantum Field Theory. We detail both the algorithmic methodology that enabled this discovery and the resulting mathematical derivations.

76. 【2603.04722】Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

链接：https://arxiv.org/abs/2603.04722

作者：Jihoon Jeong

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Clinical Model Sciences, Model Architectural Medicine, Basic Model Sciences, Model Sciences, introduces Model Medicine

备注： 56 pages, 7 figures. Project page: [this https URL](https://jihoonjeong.github.io/model-medicine/)

点击查看摘要

Abstract:Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions -- Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core--Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis -- a biologically-inspired three-layer parameter architecture -- and a therapeutic framework connecting diagnosis to treatment.

77. 【2603.04718】AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

链接：https://arxiv.org/abs/2603.04718

作者：Kylie Zhang,Nimra Nadeem,Lucia Zheng,Dominik Stammbach,Peter Henderson

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：judges probe attorneys, judges probe, factual record, Supreme Court oral, oral argument

备注： Accepted at CS Law 2026

点击查看摘要

Abstract:In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.

78. 【2603.04707】Detection of Illicit Content on Online Marketplaces using Large Language Models

链接：https://arxiv.org/abs/2603.04707

作者：Quoc Khoa Tran,Thanh Thi Nguyen,Campbell Wilson

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：including drug trafficking, revolutionizing global commerce, counterfeit sales, global commerce, including drug

备注： Accepted for publication in the Proceedings of the 8th International Conference on Natural Language Processing (ICNLP 2026)

点击查看摘要

Abstract:Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.

79. 【2603.04698】Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement

链接：https://arxiv.org/abs/2603.04698

作者：Brian Jing Hong Nge,Stefan Su,Thanh Thi Nguyen,Campbell Wilson,Alexandra Phelan,Naomi Pfitzner

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Frequency-Inverse Document Frequency, Term Frequency-Inverse Document, Delta Term Frequency-Inverse, Document Frequency, comparing traditional classifiers

备注： Accepted for publication in the Proceedings of the 8th International Conference on Natural Language Processing (ICNLP 2026)

点击查看摘要

Abstract:This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.

80. 【2603.04691】Non-Zipfian Distribution of Stopwords and Subset Selection Models

链接：https://arxiv.org/abs/2603.04691

作者：Wentian Li,Oscar Fontanelli

类目：Computation and Language (cs.CL)

关键词：Zipf law, Beta Rank Function, function, Stopwords, Hill function

备注： 6 figures

点击查看摘要

Abstract:Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^\gamma)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^\gamma)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.

81. 【2603.04678】Optimizing Language Models for Crosslingual Knowledge Consistency

链接：https://arxiv.org/abs/2603.04678

作者：Tianyu Liu,Jirui Qi,Mrinmaya Sachan,Ryan Cotterell,Raquel Fernández,Arianna Bisazza

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, Large language, exhibit inconsistent knowledge, Large, Direct Consistency Optimization

备注： Under review. The first two authors contributed equally

点击查看摘要

Abstract:Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at this https URL.

82. 【2603.04670】Using Vision + Language Models to Predict Item Difficulty

链接：https://arxiv.org/abs/2603.04670

作者：Samin Khan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：data visualization literacy, large language models, visualization literacy test, project investigates, investigates the capabilities

备注：

点击查看摘要

Abstract:This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.

83. 【2603.04657】Stan: An LLM-based thermodynamics course assistant

链接：https://arxiv.org/abs/2603.04657

作者：Eric M. Furst,Vasudevan Venkateshwaran

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Physics Education (physics.ed-ph)

关键词：remains largely unexplored, education focus predominantly, instructors remains largely, support instructors remains, problem generators

备注： 17 pages, 6 figures. For associated code repository, see [this https URL](https://github.com/EntropicLearners/stan.git)

点击查看摘要

Abstract:Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.

84. 【2603.04656】AgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

链接：https://arxiv.org/abs/2603.04656

作者：Preetam Prabhu Srikar Dammu,Arnav Palkhiwala,Tanya Roosta,Chirag Shah

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：generative QA systems, tools that browse, emergence of search-enabled, search-enabled generative, increasingly turning

备注：

点击查看摘要

Abstract:With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.

85. 【2603.04647】Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models

链接：https://arxiv.org/abs/2603.04647

作者：Xin Chen,Saili Uday Gadgil,Jiarong Qiu

类目：Computation and Language (cs.CL)

关键词：introducing external knowledge, Retrieval augmented generation, generation mitigates limitations, large language models, Retrieval augmented

备注：

点击查看摘要

Abstract:Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.

86. 【2603.04601】Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

链接：https://arxiv.org/abs/2603.04601

作者：Hung Tran,Langston Nashold,Rayan Krishnan,Antoine Bigeard,Alex Gu

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Vibe Code Bench, measure isolated tasks, introduce Vibe Code, existing benchmarks measure, benchmarks measure isolated

备注： Live leaderboard hosted here: [this https URL](https://www.vals.ai/benchmarks/vibe-code) . Preprint, currently under review. Benchmark first released Nov 2025

点击查看摘要

Abstract:Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

Comments:
Live leaderboard hosted here: this https URL. Preprint, currently under review. Benchmark first released Nov 2025

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.7

Cite as:
arXiv:2603.04601 [cs.SE]

(or
arXiv:2603.04601v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2603.04601

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

87. 【2603.04597】Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

链接：https://arxiv.org/abs/2603.04597

作者：Lei Huang,Xiang Cheng,Chenxiao Zhao,Guobin Shen,Junjie Yang,Xiaocheng Feng,Yuxuan Gu,Xing Yu,Bing Qin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, typically receive diverse, receive diverse natural, Large language, typically receive

备注：

点击查看摘要

Abstract:Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at this https URL.

88. 【2603.04592】From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

链接：https://arxiv.org/abs/2603.04592

作者：Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen

类目：Computation and Language (cs.CL)

关键词：Standard Large Language, Large Language Models, Standard Large, Language Models, Large Language

备注：

点击查看摘要

Abstract:Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at this https URL.

89. 【2603.04549】Adaptive Memory Admission Control for LLM Agents

链接：https://arxiv.org/abs/2603.04549

作者：Guilin Zhang,Wei Jiang,Xiejiashan Wang,Aisha Behr,Kai Zhao,Jeffrey Friedman,Xu Chu,Amine Anoun

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：support multi-session reasoning, agents increasingly rely, current systems provide, reasoning and interaction, information is retained

备注：

点击查看摘要

Abstract:LLM-based agents increasingly rely on long-term memory to support multi-session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit. As a result, memory admission remains a poorly specified and weakly controlled component in agent architectures. To address this gap, we propose Adaptive Memory Admission Control (A-MAC), a framework that treats memory admission as a structured decision problem. A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. The framework combines lightweight rule-based feature extraction with a single LLM-assisted utility assessment, and learns domain-adaptive admission policies through cross-validated optimization. This design enables transparent and efficient control over long-term memory. Experiments on the LoCoMo benchmark show that A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. Ablation results identify content type prior as the most influential factor for reliable memory admission. These findings demonstrate that explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM-based agents.

90. 【2603.04532】Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

链接：https://arxiv.org/abs/2603.04532

作者：Nathan Kuissi,Suraj Subrahmanyan,Nandan Thakur,Jimmy Lin

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Cranfield paradigm, follow the Cranfield, benchmarks typically follow, relying on static, Information retrieval

备注：

点击查看摘要

Abstract:Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall $\tau$ at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at this https URL.

91. 【2603.04454】Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

链接：https://arxiv.org/abs/2603.04454

作者：Michael Majurski,Cynthia Matuszek

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Language Models, carefully and unambiguously, profound impact, Language, context

备注：

点击查看摘要

Abstract:How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at this https URL

92. 【2603.04453】Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

链接：https://arxiv.org/abs/2603.04453

作者：Wai Tuck Wong,Jun Sun,Arunesh Sinha

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：multimodal large language, utmost importance, large language models, language models, models

备注：

点击查看摘要

Abstract:The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.

93. 【2603.04452】A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science

链接：https://arxiv.org/abs/2603.04452

作者：Zonglin Yang,Runze Mao,Tianhao Wu,Han Li,QingGuo Zhou,Zhi X. Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, foundation Large Language, Large Language, advance foundation Large, developing domain-specialized models

备注： 5 figures, 1 table

点击查看摘要

Abstract:To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage's performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).

94. 【2603.04448】SkillNet: Create, Evaluate, and Connect AI Skills

链接：https://arxiv.org/abs/2603.04448

作者：Yuan Liang,Ruobin Zhong,Haoming Xu,Chen Jiang,Yi Zhong,Runnan Fang,Jia-Chen Gu,Shumin Deng,Yunzhi Yao,Mengru Wang,Shuofei Qiao,Xin Xu,Tongtong Wu,Kun Wang,Yang Liu,Zhen Bi,Jungang Lou,Yuchen Eleanor Jiang,Hangcheng Zhu,Gang Yu,Haiwen Hong,Longtao Huang,Hui Xue,Chenxi Wang,Yijun Wang,Zifei Shan,Xi Chen,Zhaopeng Tu,Feiyu Xiong,Xin Xie,Peng Zhang,Zhengke Gui,Lei Liang,Jun Zhou,Chiyu Wu,Jin Shang,Yu Gong,Junyu Lin,Changliang Xu,Hongjie Deng,Wen Zhang,Keyan Ding,Qiang Zhang,Fei Huang,Ningyu Zhang,Jeff Z. Pan,Guilin Qi,Haofen Wang,Huajun Chen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：execute complex tasks, flexibly invoke tools, complex tasks, flexibly invoke, invoke tools

备注： [this http URL](http://skillnet.openkg.cn/)

点击查看摘要

Abstract:Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in isolated contexts without leveraging prior strategies. To overcome this limitation, we introduce SkillNet, an open infrastructure designed to create, evaluate, and organize AI skills at scale. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. By formalizing skills as evolving, composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.

95. 【2603.04445】Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

链接：https://arxiv.org/abs/2603.04445

作者：Yasmin Moslem,John D. Kelleher

类目：Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Performance (cs.PF)

关键词：intelligent model selection, inference time, rapid growth, growth of large, created a critical

备注： Work funded by ADAPT Centre, Trinity College Dublin, and Huawei Ireland

点击查看摘要

Abstract:The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.

Comments:
Work funded by ADAPT Centre, Trinity College Dublin, and Huawei Ireland

Subjects:

Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Performance (cs.PF)

Cite as:
arXiv:2603.04445 [cs.NI]

(or
arXiv:2603.04445v1 [cs.NI] for this version)

https://doi.org/10.48550/arXiv.2603.04445

Focus to learn more

              arXiv-issued DOI via DataCite</p>

96. 【2603.04429】What Is Missing: Interpretable Ratings for Large Language Model Outputs

链接：https://arxiv.org/abs/2603.04429

作者：Nicholas Stranges,Yimin Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Proximal Policy Optimization, Current Large Language, Large Language Model, Direct Preference Optimization, Preference Optimization learn

备注： 22 pages

点击查看摘要

Abstract:Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge's missing-information text that produced it, enabling qualitative debugging of the preference labels.

97. 【2603.04423】Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation

链接：https://arxiv.org/abs/2603.04423

作者：Gürsel Akdeniz,Emin Cagatay Nakilcioglu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：VHF radio miscommunication, human factors accounting, radio miscommunication remains, VHF radio communications, incidents in Europe

备注：

点击查看摘要

Abstract:VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO's SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.

98. 【2603.04421】Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

链接：https://arxiv.org/abs/2603.04421

作者：Grace Chang Yuan,Xiaoman Zhang,Sung Eun Kim,Pranav Rajpurkar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词：refine medical reasoning, Multi-agent large language, large language model, leveraging collaboration, medical reasoning

备注： Accepted as Oral at the EACL 2026 Workshop on Healthcare and Language Learning (HeaLing)

点击查看摘要

Abstract:Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

99. 【2603.04419】Context-Dependent Affordance Computation in Vision-Language Models

链接：https://arxiv.org/abs/2603.04419

作者：Murad Farzulla

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：vision-language models, characterize the phenomenon, computation in vision-language, context-dependent affordance computation, affordance computation

备注： 31 pages, 8 tables, 4 figures, 43 references. Code available at: [this https URL](https://github.com/studiofarzulla/semantic-vision)

点击查看摘要

Abstract:We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p 0.0001), indicating that 90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.

100. 【2603.04417】Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

链接：https://arxiv.org/abs/2603.04417

作者：Fiona Lau

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, automated evaluators, evaluators in research, Large

备注： 19 pages, 14 figures

点击查看摘要

Abstract:Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.

101. 【2603.04416】Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction

链接：https://arxiv.org/abs/2603.04416

作者：Rabab Alkhalifa

类目：Computation and Language (cs.CL)

关键词：Arabic social media, Framing detection, cultural grounding, interpretive ambiguity, limited reliable supervision

备注：

点击查看摘要

Abstract:Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.

102. 【2603.04415】he Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

链接：https://arxiv.org/abs/2603.04415

作者：Ruobing Zheng,Tianqi Li,Jianing Li,Qingpei Guo,Yi Yuan,Jingdong Chen

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：reasoning-enhanced Large Language, Large Language Models, scenarios remains uncertain, demonstrated remarkable advances, Large Language

备注： Project Page: [this https URL](https://digital-avatar.github.io/ai/ThinkingBoundary/)

点击查看摘要

Abstract:While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.

103. 【2603.04414】Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks

链接：https://arxiv.org/abs/2603.04414

作者：Mahmoud Abusaqer,Jamil Saquer

类目：Computation and Language (cs.CL)

关键词：implicit targeting strategies, Multiclass hate speech, Multiclass hate, remains computationally challenging, computationally challenging due

备注： 15 pages, 2 figures, 6 tables. Accepted for publication in the Proceedings of the 12th Annual Conference on Computational Science Computational Intelligence (CSCI'25)

点击查看摘要

Abstract:Multiclass hate speech detection across demographic categories remains computationally challenging due to implicit targeting strategies and linguistic variability in social media content. Existing approaches rely solely on learned representations from training data, without explicitly incorporating structured ontological frameworks that can enhance classification through formal domain knowledge integration. We propose RoBERTa-OTA, which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. The architecture combines RoBERTa embeddings with scaled attention layers and graph neural networks to integrate contextual language understanding with domain-specific semantic knowledge. Evaluation across 39,747 balanced samples using 5-fold cross-validation demonstrates significant performance gains over baseline RoBERTa implementations and existing state-of-the-art methods. RoBERTa-OTA achieves 96.04\% accuracy compared to 95.02\% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points. The enhanced architecture maintains computational efficiency with only 0.33\% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.

104. 【2603.04413】Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

链接：https://arxiv.org/abs/2603.04413

作者：Natalie Perez,Sreyoshi Bhaduri,Aman Chadha

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：fixed word-concept mappings, context dependent, arising from dynamic, word-concept mappings, dynamic systems

备注：

点击查看摘要

Abstract:Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.

105. 【2603.04412】Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

链接：https://arxiv.org/abs/2603.04412

作者：O.V. Usatenko,S.S. Melnyk,G.M. Pritula

类目：Computation and Language (cs.CL)

关键词：Large-scale language models, high-dimensional state spaces, classical Markov structures, extremely high-dimensional state, hidden representations create

备注： 10 pages, 3 figures

点击查看摘要

Abstract:Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.

106. 【2603.04411】One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

链接：https://arxiv.org/abs/2603.04411

作者：Liming Lu,Kaixi Qiu,Jiayu Zhou,Jushi Kai,Haoyan Zhang,Huanyu Wang,Jingwen Leng,Ziwei He,Zhouhan Lin

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Language Models, Large Language, progress of Large, escalating memory footprint

备注：

点击查看摘要

Abstract:Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.

107. 【2603.04410】SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

链接：https://arxiv.org/abs/2603.04410

作者：Omar Abdelnasser,Fatemah Alharbi,Khaled Khasawneh,Ihsen Alouani,Mohammed E. Fouda

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Arabic Language Models, Arabic Natural Language, Natural Language Processing, leverage Arabic Language, Language Models

备注：

点击查看摘要

Abstract:Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.

108. 【2603.04409】Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

链接：https://arxiv.org/abs/2603.04409

作者：Nora Petrova,Andrew Gordon,Enzo Blindow

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词：faces significant challenges, language models faces, large language models, large language, models faces significant

备注： Published as a conference paper at ICLR 2026. 21 pages, 11 figures. [this https URL](https://openreview.net/forum?id=kVaE2kYjtV)

点击查看摘要

Abstract:The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \ Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.

109. 【2603.04408】Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

链接：https://arxiv.org/abs/2603.04408

作者：Luzhou Peng,Zhengxin Yang,Honglu Ji,Yikang Yang,Fanda Fan,Wanling Gao,Jiayuan Ge,Yilin Han,Jianfeng Zhan

类目：Computation and Language (cs.CL)

关键词：yielding coarse descriptions, Current evaluation paradigms, large language models, Current evaluation, yielding coarse

备注： 43 pages, 24 figures, 21 tables

点击查看摘要

Abstract:Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.

110. 【2603.04407】Semantic Containment as a Fundamental Property of Emergent Misalignment

链接：https://arxiv.org/abs/2603.04407

作者：Rohan Saxena

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：behavioral failures extending, behavioral failures, failures extending, data, triggers

备注：

点击查看摘要

Abstract:Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2603.04407 [cs.CL]

(or
arXiv:2603.04407v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.04407

Focus to learn more

              arXiv-issued DOI via DataCite</p>

111. 【2603.04406】CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

链接：https://arxiv.org/abs/2603.04406

作者：Zhehao Tan,Yihan Jiao,Dan Yang,Junjie Wang,Duolin Sun,Jie Feng,Xidong Wang,Lei Liu,Yue Shen,Jian Wang,Jinjie Gu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Retrieval-Augmented Generation, large language models, training large language, increasingly important, large language

备注：

点击查看摘要

Abstract:With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.

112. 【2603.04404】Signal in the Noise: Decoding the Reality of Airline Service Quality with Large Language Models

链接：https://arxiv.org/abs/2603.04404

作者：Ahmed Dawoud,Osama El-Shamy,Ahmed Habashy

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large Language Model, service quality metrics, unstructured online feedback, online feedback, quality metrics

备注：

点击查看摘要

Abstract:Traditional service quality metrics often fail to capture the nuanced drivers of passenger satisfaction hidden within unstructured online feedback. This study validates a Large Language Model (LLM) framework designed to extract granular insights from such data. Analyzing over 16,000 TripAdvisor reviews for EgyptAir and Emirates (2016-2025), the study utilizes a multi-stage pipeline to categorize 36 specific service issues. The analysis uncovers a stark "operational perception disconnect" for EgyptAir: despite reported operational improvements, passenger satisfaction plummeted post-2022 (ratings 2.0). Our approach identified specific drivers missed by conventional metrics-notably poor communication during disruptions and staff conduct-and pinpointed critical sentiment erosion in key tourism markets. These findings confirm the framework's efficacy as a powerful diagnostic tool, surpassing traditional surveys by transforming unstructured passenger voices into actionable strategic intelligence for the airline and tourism sectors.

113. 【2603.04403】FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents

链接：https://arxiv.org/abs/2603.04403

作者：Eric Y. Kim,Jie Huang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：retrieve specific numeric, agents increasingly assist, increasingly assist, retrieve specific, specific numeric

备注： 26 pages, 2 figures, 16 tables

点击查看摘要

Abstract:AI agents increasingly assist with financial research, yet no benchmark evaluates their ability to retrieve specific numeric values from structured databases. We introduce FinRetrieval, a benchmark of 500 financial retrieval questions with ground truth answers, agent responses from 14 configurations across three frontier providers (Anthropic, OpenAI, Google), and complete tool call execution traces. Our evaluation reveals that tool availability dominates performance: Claude Opus achieves 90.8% accuracy with structured data APIs but only 19.8% with web search alone--a 71 percentage point gap that exceeds other providers by 3-4x. We find that reasoning mode benefits vary inversely with base capability (+9.0pp for OpenAI vs +2.8pp for Claude), explained by differences in base-mode tool utilization rather than reasoning ability. Geographic performance gaps (5.6pp US advantage) stem from fiscal year naming conventions, not model limitations. We release the dataset, evaluation code, and tool traces to enable research on financial AI systems.

114. 【2603.04402】SearchGym: A Modular Infrastructure for Cross-Platform Benchmarking and Hybrid Search Orchestration

链接：https://arxiv.org/abs/2603.04402

作者：Jerome Tze-Hou Hsu

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：fundamental gap remains, Retrieval-Augmented Generation, proliferation of toolkits, prototypes and robust, rapid growth

备注： 5 pages, 5 figures

点击查看摘要

Abstract:The rapid growth of Retrieval-Augmented Generation (RAG) has created a proliferation of toolkits, yet a fundamental gap remains between experimental prototypes and robust, production-ready systems. We present SearchGym, a modular infrastructure designed for cross-platform benchmarking and hybrid search orchestration. Unlike existing model-centric frameworks, SearchGym decouples data representation, embedding strategies, and retrieval logic into stateful abstractions: Dataset, VectorSet, and App. This separation enables a Compositional Config Algebra, allowing designers to synthesize entire systems from hierarchical configurations while ensuring perfect reproducibility. Moreover, we analyze the "Top-$k$ Cognizance" in hybrid retrieval pipelines, demonstrating that the optimal sequence of semantic ranking and structured filtering is highly dependent on filter strength. Evaluated on the LitSearch expert-annotated benchmark, SearchGym achieves a 70% Top-100 retrieval rate. SearchGym reveals a design tension between generalizability and optimizability, presenting the potential where engineering optimization may serve as a tool for uncovering the causal mechanisms inherent in information retrieval across heterogeneous domains. An open-source implementation of SearchGym is available at: this https URL

115. 【2603.03510】A theoretical model of dynamical grammatical gender shifting based on set-valued set function

链接：https://arxiv.org/abs/2603.03510

作者：Mohamed El Idrissi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：study investigates, investigates the diverse, diverse characteristics, model, Modular Cognitive model

备注： 20 pages, 2 figures, 4 tables

点击查看摘要

Abstract:This study investigates the diverse characteristics of nouns, focusing on both semantic (e.g., countable/uncountable) and morphosyntactic (e.g., masculine/feminine) distinctions. We explore inter-word variations for gender markers in noun morphology. Grammatical gender shift is a widespread phenomenon in languages around the world. The aim is to uncover through a formal model the underlying patterns governing the variation of lexemes. To this end, we propose a new computational component dedicated to pairing items with morphological templates (e.g., the result of a generated item-template pair: (funas, $\{N, +SG, -PL, -M, +F, -COL, +SING\}$), with its spell-out form: $ð$a-funast 'cow'). This process is formally represented by the Template-Based and Modular Cognitive model. This proposed model, defined by a set-valued set function $h : \mathscr{P}(M) \rightarrow \mathscr{P}(M)$, predicts the nonlinear dynamic mapping of lexical items onto morphological templates. By applying this formalism, we present a unified framework for understanding the complexities of morphological markings across languages. Through empirical observations, we demonstrate how these shifts, as well as non-gender shifts, arise during lexical changes, especially in Riffian. Our model posits that these variant markings emerge due to template shifts occurring during word and meaning's formation. By formally demonstrating that conversion is applicable to noun-to-noun derivation, we challenge and broaden the conventional view of word formation. This mathematical model not only contributes to a deeper understanding of morphosyntactic variation but also offers potential applications in other fields requiring precise modelling of linguistic patterns.

116. 【2603.04840】An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

链接：https://arxiv.org/abs/2603.04840

作者：Jihwan Lee,Parsa Razmara,Kevin Huang,Sean Foley,Aditya Kommineni,Haley Hsu,Woojae Jeong,Prakash Kumar,Xuan Shi,Yoonjeong Lee,Tiantian Feng,Takfarinas Medani,Ye Tian,Sudarsana Reddy Kadiri,Krishna S. Nayak,Dani Byrd,Louis Goldstein,Richard M. Leahy,Shrikanth Narayanan

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：spanning neural planning, complex process spanning, process spanning neural, motor control, neural planning

备注：

点击查看摘要

Abstract:Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.

信息检索

1. 【2603.05207】Core-based Hierarchies for Efficient GraphRAG

链接：https://arxiv.org/abs/2603.05207

作者：Jakir Hossain,Ahmet Erdem Sarıyüce

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：enhances large language, incorporating external knowledge, large language models, enhances large, large language

备注：

点击查看摘要

2. 【2603.04986】Debiasing Sequential Recommendation with Time-aware Inverse Propensity Scoring

链接：https://arxiv.org/abs/2603.04986

作者：Sirui Huang,Jing Long,Qian Li,Guandong Xu,Qing Li

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Inverse Propensity Scoring, Propensity Scoring, Inverse Propensity, Sequential, predicts users

备注： 11 pages

点击查看摘要

Abstract:Sequential Recommendation (SR) predicts users next interactions by modeling the temporal order of their historical behaviors. Existing approaches, including traditional sequential models and generative recommenders, achieve strong performance but primarily rely on explicit interactions such as clicks or purchases while overlooking item exposures. This ignorance introduces selection bias, where exposed but unclicked items are misinterpreted as disinterest, and exposure bias, where unexposed items are treated as irrelevant. Effectively addressing these biases requires distinguishing between items that were "not exposed" and those that were "not of interest", which cannot be reliably inferred from correlations in historical data. Counterfactual reasoning provides a natural solution by estimating user preferences under hypothetical exposure, and Inverse Propensity Scoring (IPS) is a common tool for such estimation. However, conventional IPS methods are static and fail to capture the sequential dependencies and temporal dynamics of user behavior. To overcome these limitations, we propose Time aware Inverse Propensity Scoring (TIPS). Unlike traditional static IPS, TIPS effectively accounts for sequential dependencies and temporal dynamics, thereby capturing user preferences more accurately. Extensive experiments show that TIPS consistently enhances recommendation performance as a plug-in for various sequential recommenders. Our code will be publicly available upon acceptance.

3. 【2603.04925】Detecting RAG Advertisements Across Advertising Styles

链接：https://arxiv.org/abs/2603.04925

作者：Sebastian Heineking,Wilhelm Pertsch,Ines Zelch,Janek Bevendorff,Benno Stein,Matthias Hagen,Martin Potthast

类目：Information Retrieval (cs.IR)

关键词：Large language models, Large language, contextually relevant ads, retrieval-augmented generation, blended with contextually

备注：

点击查看摘要

Abstract:Large language models (LLMs) enable a new form of advertising for retrieval-augmented generation (RAG) systems in which organic responses are blended with contextually relevant ads. The prospect of such "generated native ads" has sparked interest in whether they can be detected automatically. Existing datasets, however, do not reflect the diversity of advertising styles discussed in the marketing literature. In this paper, we (1) develop a taxonomy of advertising styles for LLMs, combining the style dimensions of explicitness and type of appeal, (2) simulate that advertisers may attempt to evade detection by changing their advertising style, and (3) evaluate a variety of ad-detection approaches with respect to their robustness under these changes. Expanding previous work on ad detection, we train models that use entity recognition to exactly locate an ad in an LLM response and find them to be both very effective at detecting responses with ads and largely robust to changes in the advertising style. Since ad blocking will be performed on low-resource end-user devices, we include lightweight models like random forests and SVMs in our evaluation. These models, however, are brittle under such changes, highlighting the need for further efficiency-oriented research for a practical approach to blocking of generated ads.

4. 【2603.04836】Beyond Text: Aligning Vision and Language for Multimodal E-Commerce Retrieval

链接：https://arxiv.org/abs/2603.04836

作者：Qujiaheng Zhang,Guagnyue Xu,Fengjie Li

类目：Information Retrieval (cs.IR)

关键词：customers make purchase, Modern e-commerce search, make purchase decisions, customers make, Modern e-commerce

备注：

点击查看摘要

Abstract:Modern e-commerce search is inherently multimodal: customers make purchase decisions by jointly considering product text and visual informations. However, most industrial retrieval and ranking systems primarily rely on textual information, underutilizing the rich visual signals available in product images. In this work, we study unified text-image fusion for two-tower retrieval models in the e-commerce domain. We demonstrate that domain-specific fine-tuning and two stage alignment between query with product text and image modalities are both crucial for effective multimodal retrieval. Building on these insights, we propose a noval modality fusion network to fuse image and text information and capture cross-modal complementary information. Experiments on large-scale e-commerce datasets validate the effectiveness of the proposed approach.

5. 【2603.04816】Scaling Laws for Reranking in Information Retrieval

链接：https://arxiv.org/abs/2603.04816

作者：Rahul Seetharaman,Aman Bansal,Hamed Zamani,Kaustubh Dhole

类目：Information Retrieval (cs.IR)

关键词：natural language generation, multi-stage retrieval systems, retrieval systems, range of tasks, compute grow

备注：

点击查看摘要

Abstract:Scaling laws have been observed across a wide range of tasks, such as natural language generation and dense retrieval, where performance follows predictable patterns as model size, data, and compute grow. However, these scaling laws are insufficient for understanding the scaling behavior of multi-stage retrieval systems, which typically include a reranking stage. In large-scale multi-stage retrieval systems, reranking is the final and most influential step before presenting a ranked list of items to the end user. In this work, we present the first systematic study of scaling laws for rerankers by analyzing performance across model sizes and data budgets for three popular paradigms: pointwise, pairwise, and listwise reranking. Using a detailed case study with cross-encoder rerankers, we demonstrate that performance follows a predictable power law. This regularity allows us to accurately forecast the performance of larger models for some metrics more than others using smaller-scale experiments, offering a robust methodology for saving significant computational resources. For example, we accurately estimate the NDCG of a 1B-parameter model by training and evaluating only smaller models (up to 400M parameters), in both in-domain as well as out-of-domain settings. Our experiments encompass span several loss functions, models and metrics and demonstrate that downstream metrics like NDCG, MAP (Mean Avg Precision) show reliable scaling behavior and can be forecasted accurately at scale, while highlighting the limitations of metrics like Contrastive Entropy and MRR (Mean Reciprocal Rank) which do not follow predictable scaling behavior in all instances. Our results establish scaling principles for reranking and provide actionable insights for building industrial-grade retrieval systems.

6. 【2603.04743】DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

链接：https://arxiv.org/abs/2603.04743

作者：Maojun Sun,Yue Wu,Yifei Xie,Ruijian Han,Binyan Jiang,Defeng Sun,Yancheng Yuan,Jian Huang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Model, Large Language, automate data-science workflows, Language Model, rigorous statistical methods

备注： 24 pages,7 figures, 3 tables

点击查看摘要

7. 【2603.04741】CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics

链接：https://arxiv.org/abs/2603.04741

作者：Gyanendra Shrestha,Anna Pyayt,Michael Gubanov

类目：Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, capturing language semantics, Large pre-trained models, capturing language

备注：

点击查看摘要

Abstract:Large pre-trained models (LMs) and Large Language Models (LLMs) are typically effective at capturing language semantics and contextual relationships. However, these models encounter challenges in maintaining optimal performance on tasks involving numbers. Blindly treating numerical or structured data as terms is inadequate -- their semantics must be well understood and encoded by the models. In this paper, we propose CONE, a hybrid transformer encoder pre-trained model that encodes numbers, ranges, and gaussians into an embedding vector space preserving distance. We introduce a novel composite embedding construction algorithm that integrates numerical values, ranges or gaussians together with their associated units and attribute names to precisely capture their intricate semantics. We conduct extensive experimental evaluation on large-scale datasets across diverse domains (web, medical, finance, and government) that justifies CONE's strong numerical reasoning capabilities, achieving an F1 score of 87.28% on DROP, a remarkable improvement of up to 9.37% in F1 over state-of-the-art (SOTA) baselines, and outperforming major SOTA models with a significant Recall@10 gain of up to 25%.

8. 【2603.04656】AgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

链接：https://arxiv.org/abs/2603.04656

作者：Preetam Prabhu Srikar Dammu,Arnav Palkhiwala,Tanya Roosta,Chirag Shah

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：generative QA systems, tools that browse, emergence of search-enabled, search-enabled generative, increasingly turning

备注：

点击查看摘要

9. 【2603.04532】Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

链接：https://arxiv.org/abs/2603.04532

作者：Nathan Kuissi,Suraj Subrahmanyan,Nandan Thakur,Jimmy Lin

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Cranfield paradigm, follow the Cranfield, benchmarks typically follow, relying on static, Information retrieval

备注：

点击查看摘要

10. 【2603.04404】Signal in the Noise: Decoding the Reality of Airline Service Quality with Large Language Models

链接：https://arxiv.org/abs/2603.04404

作者：Ahmed Dawoud,Osama El-Shamy,Ahmed Habashy

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：Large Language Model, service quality metrics, unstructured online feedback, online feedback, quality metrics

备注：

点击查看摘要

11. 【2603.04403】FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents

链接：https://arxiv.org/abs/2603.04403

作者：Eric Y. Kim,Jie Huang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：retrieve specific numeric, agents increasingly assist, increasingly assist, retrieve specific, specific numeric

备注： 26 pages, 2 figures, 16 tables

点击查看摘要

12. 【2603.04402】SearchGym: A Modular Infrastructure for Cross-Platform Benchmarking and Hybrid Search Orchestration

链接：https://arxiv.org/abs/2603.04402

作者：Jerome Tze-Hou Hsu

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：fundamental gap remains, Retrieval-Augmented Generation, proliferation of toolkits, prototypes and robust, rapid growth

备注： 5 pages, 5 figures

点击查看摘要

计算机视觉

1. 【2603.05507】ransformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

链接：https://arxiv.org/abs/2603.05507

作者：Leif Van Holland,Domenic Zingsheim,Mana Takhsha,Hannah Dröge,Patrick Stotko,Markus Plack,Reinhard Klein

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：streaming from multiple, crucial for immersive, immersive experiences, High-quality, multiple cameras

备注： You can find the project page [this https URL](https://github.com/vc-bonn/transformer-based-inpainting)

点击查看摘要

Abstract:High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

2. 【2603.05506】FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

链接：https://arxiv.org/abs/2603.05506

作者：Weijie Lyu,Ming-Hsuan Yang,Zhixin Shu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：portrait video input, monocular human portrait, human portrait video, system that generates, customizable camera trajectories

备注： Accepted by CVPR 2026. Project page: [this https URL](https://weijielyu.github.io/FaceCam)

点击查看摘要

Abstract:We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.

3. 【2603.05503】Accelerating Text-to-Video Generation with Calibrated Sparse Attention

链接：https://arxiv.org/abs/2603.05503

作者：Shai Yehezkel,Shahar Yadin,Noam Elata,Yaron Ostrovsky-Berman,Bahjat Kawar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：models enable high-quality, Recent diffusion models, enable high-quality video, slow runtimes, Recent diffusion

备注：

点击查看摘要

Abstract:Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

4. 【2603.05484】owards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

链接：https://arxiv.org/abs/2603.05484

作者：Guo Chen,Lidong Lu,Yicheng Liu,Liangrui Dong,Lidong Zou,Jixin Lv,Zhenquan Li,Xinyi Mao,Baoqi Pei,Shihao Wang,Zhiqi Li,Karan Sapra,Fuxiao Liu,Yin-Dong Zheng,Yifei Huang,Limin Wang,Zhiding Yu,Andrew Tao,Guilin Liu,Tong Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unscripted daily life, densely concatenated clips, Multimodal Lifelong Understanding, hour-long durations, differ from natural

备注：

点击查看摘要

Abstract:While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

5. 【2603.05473】owards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields

链接：https://arxiv.org/abs/2603.05473

作者：Scout Jarman,Zigfried Hampel-Arias,Adra Carr,Kevin R. Moon

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：LWIR HSI, gas plume detection, ranging from environmental, national security, environmental monitoring

备注： This manuscript was submitted to SPIE JARS and is under review. Code and Data can be found at [this https URL](https://github.com/lanl/HSI-Nerfstudio) and [this https URL](https://zenodo.org/records/18626884) respectively. Video 1 and Video 2 can be found at [this https URL](https://github.com/lanl/HSI-Nerfstudio/blob/main/renders/paper/grid_Falsecolor.mp4) and [this https URL](https://github.com/lanl/HSI-Nerfstudio/blob/main/renders/paper/grid_ACE.mp4) respectively

点击查看摘要

Abstract:Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.

6. 【2603.05465】HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

链接：https://arxiv.org/abs/2603.05465

作者：Sai Akhil Kogilathota,Sripadha Vallabha E G,Luzhe Sun,Jiawei Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：describe nonexistent objects, fabricate facts, remain a persistent, persistent challenge, describe nonexistent

备注：

点击查看摘要

Abstract:Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

7. 【2603.05463】EdgeDAM: Real-time Object Tracking for Mobile Devices

链接：https://arxiv.org/abs/2603.05463

作者：Syed Muhammad Raza,Syed Murtaza Hussain Abidi,Khawar Islam,Muhammad Ibrahim,Ajmal Saeed Mian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision task, critical computer vision, continuous target localization, Single-object tracking, distractor-aware memory

备注： 10 pages

点击查看摘要

Abstract:Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.

8. 【2603.05454】Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

链接：https://arxiv.org/abs/2603.05454

作者：Pengxiang Li,Joey Tsai,Hongwei Xue,Kunyu Shi,Shilin Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Language Models, Diffusion Language, promise highly parallel, highly parallel text, suboptimal decoding schedulers

备注： Accepted at ICLR 2026

点击查看摘要

Abstract:Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

9. 【2603.05449】RealWonder: Real-Time Physical Action-Conditioned Video Generation

链接：https://arxiv.org/abs/2603.05449

作者：Wei Liu,Ziyu Chen,Zizhang Li,Yue Wang,Hong-Xing Yu,Jiajun Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词：Current video generation, simulate physical consequences, lack structural understanding, Current video, robotic manipulations

备注： The first two authors contributed equally. The last two authors advised equally. Project website: [this https URL](https://liuwei283.github.io/RealWonder/)

点击查看摘要

Abstract:Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: this https URL

10. 【2603.05446】NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

链接：https://arxiv.org/abs/2603.05446

作者：Kanon Amemiya,Daichi Yashima,Kei Katsumata,Takumi Komatsu,Ryosuke Korekata,Seitaro Otsuki,Komei Sugiura

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：represent multi-layered user, retrieving nail design, task of retrieving, represent multi-layered, multi-layered user intent

备注： Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

11. 【2603.05438】Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

链接：https://arxiv.org/abs/2603.05438

作者：Dongwon Kim,Gawon Seo,Jinsung Lee,Minsu Cho,Suha Kwak

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：enabling downstream tasks, simulating environment dynamics, environment dynamics conditioned, World models provide, World models

备注： CVPR 2026

点击查看摘要

Abstract:World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

12. 【2603.05437】SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

链接：https://arxiv.org/abs/2603.05437

作者：Ye-Chan Kim,SeungJu Cha,Si-Woo Kim,Minju Jeon,Hyungee Kim,Dong-Jin Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Weakly-Supervised Dense Video, Dense Video Captioning, Weakly-Supervised Dense, Video Captioning aims, Dense Video

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

13. 【2603.05425】RelaxFlow: Text-Driven Amodal 3D Generation

链接：https://arxiv.org/abs/2603.05425

作者：Jiayin Zhu,Guoji Fu,Xiaolu Liu,Qiyuan He,Yicong Li,Angela Yao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：determine object category, faces inherent semantic, inherent semantic ambiguity, generation faces inherent, ambiguity under occlusion

备注： Code: [this https URL](https://github.com/viridityzhu/RelaxFlow)

点击查看摘要

Abstract:Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

14. 【2603.05421】MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis

链接：https://arxiv.org/abs/2603.05421

作者：Numan Saeed,Fadillah Adamsyah Maani,Mohammad Yaqub

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：transform prenatal care, foundation models exceed, current foundation models, low-resource settings, precluding deployment

备注： Project website: [this http URL](http://www.numansaeed.com/mobilefetalclip)

点击查看摘要

Abstract:Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at this https URL.

15. 【2603.05407】Video-based Locomotion Analysis for Fish Health Monitoring

链接：https://arxiv.org/abs/2603.05407

作者：Timon Palm,Clemens Seibold,Anna Hilsmann,Peter Eisert

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：safeguards animal welfare, sustainable aquaculture practices, safeguards animal, animal welfare, aquaculture practices

备注： Accepted at VISAPP 2026

点击查看摘要

Abstract:Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.

16. 【2603.05397】Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM

链接：https://arxiv.org/abs/2603.05397

作者：Javier Laserna,Saurabh Gupta,Oscar Martinez Mozos,Cyrill Stachniss,Pablo San Segundo

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：LiDAR-based SLAM, viewpoint variation conditions, environmental ambiguity, loop closure detection, loop closure

备注： Accepted in the 2025 European Conference on Mobile Robots (ECMR). This is the author's version of the work

点击查看摘要

Abstract:Reliable loop closure detection remains a critical challenge in 3D LiDAR-based SLAM, especially under sensor noise, environmental ambiguity, and viewpoint variation conditions. RANSAC is often used in the context of loop closures for geometric model fitting in the presence of outliers. However, this approach may fail, leading to map inconsistency. We introduce a novel deterministic algorithm, CliReg, for loop closure validation that replaces RANSAC verification with a maximal clique search over a compatibility graph of feature correspondences. This formulation avoids random sampling and increases robustness in the presence of noise and outliers. We integrated our approach into a real- time pipeline employing binary 3D descriptors and a Hamming distance embedding binary search tree-based matching. We evaluated it on multiple real-world datasets featuring diverse LiDAR sensors. The results demonstrate that our proposed technique consistently achieves a lower pose error and more reliable loop closures than RANSAC, especially in sparse or ambiguous conditions. Additional experiments on 2D projection-based maps confirm its generality across spatial domains, making our approach a robust and efficient alternative for loop closure detection.

17. 【2603.05386】Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations

链接：https://arxiv.org/abs/2603.05386

作者：Hajar Dekdegue,Moncef Garouani,Josiane Mothe,Jordan Bernigaud

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：transparent artificial intelligence, artificial intelligence, decision-making process, remains a central, central challenge

备注：

点击查看摘要

Abstract:Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.

18. 【2603.05384】ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

链接：https://arxiv.org/abs/2603.05384

作者：Sijia Chen,Zihan Zhou,Yanqiu Yu,En Yu,Wenbing Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Referring Multi-Object Tracking, Omnidirectional Referring Multi-Object, Multi-Object Tracking, Referring Multi-Object, Omnidirectional Referring

备注： [this https URL](https://github.com/chen-si-jia/ORMOT)

点击查看摘要

Abstract:Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at this https URL.

19. 【2603.05377】OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

链接：https://arxiv.org/abs/2603.05377

作者：Esteban Padilla,Boyang Sun,Marc Pollefeys,Hermann Blum

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：flexible task requirements, complex everyday environments, Open-world navigation requires, Open-world navigation, make decisions

备注：

点击查看摘要

Abstract:Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

20. 【2603.05330】Dark3R: Learning Structure from Motion in the Dark

链接：https://arxiv.org/abs/2603.05330

作者：Andrew Y Guo,Anagh Malik,SaiKiran Tedla,Yutong Dai,Yiqian Qin,Zach Salehe,Benjamin Attal,Sotiris Nousias,Kyros Kutulakos,David B. Lindell

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：learning-based methods break, learning-based methods, methods break, raw images, conventional feature

备注： CVPR 2026, Project Page: [this https URL](https://andrewguo.com/pub/dark3r)

点击查看摘要

Abstract:We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.

21. 【2603.05315】Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

链接：https://arxiv.org/abs/2603.05315

作者：Guandong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Transformers, incurs substantial computational, substantial computational cost, process incurs substantial, denoising process incurs

备注：

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.

22. 【2603.05305】Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation

链接：https://arxiv.org/abs/2603.05305

作者：Kang Luo,Xin Chen,Yangyi Xiao,Hesheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving systems, works fuse LiDAR, object detection, driving systems, RGB data

备注：

点击查看摘要

Abstract:Nowadays, an increasing number of works fuse LiDAR and RGB data in the bird's-eye view (BEV) space for 3D object detection in autonomous driving systems. However, existing methods suffer from over-reliance on the LiDAR branch, with insufficient exploration of RGB information. To tackle this issue, we propose Fusion4CA, which is built upon the classic BEVFusion framework and dedicated to fully exploiting visual input with plug-and-play components. Specifically, a contrastive alignment module is designed to calibrate image features with 3D geometry, and a camera auxiliary branch is introduced to mine RGB information sufficiently during training. For further performance enhancement, we leverage an off-the-shelf cognitive adapter to make the most of pretrained image weights, and integrate a standard coordinate attention module into the fusion stage as a supplementary boost. Experiments on the nuScenes dataset demonstrate that our method achieves 69.7% mAP with only 6 training epochs and a mere 3.48% increase in inference parameters, yielding a 1.2% improvement over the baseline which is fully trained for 20 epochs. Extensive experiments in a simulated lunar environment further validate the effectiveness and generalization of our method. Our code will be released through Fusion4CA.

23. 【2603.05295】WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

链接：https://arxiv.org/abs/2603.05295

作者：Sicheng Fan,Rui Wan,Yifei Leng,Gaoning Liang,Li Ling,Yanyi Shang,Dehan Kong

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：accelerate reproducible research, largest open-source dataset, introduce WebChain, real-world websites, designed to accelerate

备注：

点击查看摘要

Abstract:We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.

24. 【2603.05280】Layer by layer, module by module: Choose both for optimal OOD probing of ViT

链接：https://arxiv.org/abs/2603.05280

作者：Ambroise Odonnat,Vasilii Feofanov,Laetitia Chapel,Romain Tavenard,Ievgen Redko

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词：Recent studies, studies have observed, foundation models, discriminative representations, discriminative self-supervised objectives

备注： Accepted at ICLR 2026 CAO Workshop

点击查看摘要

Abstract:Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

25. 【2603.05256】Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

链接：https://arxiv.org/abs/2603.05256

作者：Shan Ning,Longtian Qiu,Xuming He

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Question Answering, Knowledge-Based Visual Question, integrating external knowledge, posing significant challenges, Question Answering

备注： Accepted by ICLR 26, code and weights are publicly available

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at this https URL.

26. 【2603.05255】CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

链接：https://arxiv.org/abs/2603.05255

作者：Gong Chen,Chaokun Zhang,Tao Tang,Pengcheng Lv,Feng Li,Xin Xie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Cooperative perception significantly, perception significantly enhances, significantly enhances scene, enhances scene understanding, integrating complementary information

备注： Accepted by CVPR26

点击查看摘要

Abstract:Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.

27. 【2603.05230】Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

链接：https://arxiv.org/abs/2603.05230

作者：Serkan Ergun,Tobias Mitterer,Hubert Zangl

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：recycling requires robust, requires robust automation, robust automation solutions, automation solutions capable, sustainable textile recycling

备注： 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)

点击查看摘要

Abstract:The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

28. 【2603.05219】SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery

链接：https://arxiv.org/abs/2603.05219

作者：Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Modern Earth observation, Earth observation relies, Modern Earth, Earth observation, detailed surface properties

备注：

点击查看摘要

Abstract:Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.

29. 【2603.05202】Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation

链接：https://arxiv.org/abs/2603.05202

作者：Yingxue Su,Yiheng Zhong,Keying Zhu,Zimu Zhang,Zhuoru Zhang,Yifang Wang,Yuxin Zhang,Jingxin Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Medical image segmentation, computer-aided diagnosis, critical for computer-aided, Medical image, Semantic Anchor Constraints

备注： 9 pages, 2 figures

点击查看摘要

Abstract:Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at this https URL.

30. 【2603.05184】Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule

链接：https://arxiv.org/abs/2603.05184

作者：Muhammad Zarar,MingZheng Zhang,Xiaowang Zhang,Zhiyong Feng,Sofonias Yitagesu,Kawsar Farooq

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Patient Activity Recognition, quality of care, Patient Activity, data to improve, Activity Recognition

备注：

点击查看摘要

Abstract:Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: this https URL}

31. 【2603.05181】Mario: Multimodal Graph Reasoning with Large Language Models

链接：https://arxiv.org/abs/2603.05181

作者：Yuanfu Sun,Kang Li,Pengkang Guo,Jiajin Liu,Qiaoyu Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, large language models, advances in large, large language, opened new avenues

备注： CVPR 2026

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at this https URL.

32. 【2603.05159】Generic Camera Calibration using Blurry Images

链接：https://arxiv.org/abs/2603.05159

作者：Zezhun Shi

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Generic camera, Camera calibration, Generic camera calibration, Camera, generic camera model

备注：

点击查看摘要

Abstract:Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the ef fectiveness of our approach.

33. 【2603.05157】he Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis

链接：https://arxiv.org/abs/2603.05157

作者：Dishantkumar Sutariya,Eike Petersen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：chest X-ray, Deep learning models, Deep learning, racial shortcut learning, identify racial identity

备注： Preprint accepted for publication at BVM 2026 ( [this https URL](https://www.bvm-conf.org/) )

点击查看摘要

Abstract:Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.

34. 【2603.05152】SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction

链接：https://arxiv.org/abs/2603.05152

作者：Ningjing Fan,Yiqun Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词：achieved remarkable progress, Gaussian splatting, recent years, view synthesis, achieved remarkable

备注： Project page: [this https URL](https://gsflyer.github.io/SSR-GS/)

点击查看摘要

Abstract:In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.

Comments:
Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

Cite as:
arXiv:2603.05152 [cs.CV]

(or
arXiv:2603.05152v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.05152

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

35. 【2603.05147】Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

链接：https://arxiv.org/abs/2603.05147

作者：Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：models predominantly focuses, established reasoning techniques, Current research, models predominantly, reasoning techniques

备注：

点击查看摘要

Abstract:Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.

36. 【2603.05135】SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning

链接：https://arxiv.org/abs/2603.05135

作者：Wenqian Li,Pengfei Fang,Hui Xue

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Cross-Domain Few-Shot Learning, Few-Shot Learning, Cross-Domain Few-Shot, aims to transfer, transferability of models

备注：

点击查看摘要

Abstract:Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp this http URL address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.

37. 【2603.05114】UniPAR: A Unified Framework for Pedestrian Attribute Recognition

链接：https://arxiv.org/abs/2603.05114

作者：Minghe Xu,Rouying Wu,Jiarui Xu,Minhao Sun,Zikang Yan,Xiao Wang,ChiaWei Chu,Yu Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：intelligent retail analytics, foundational computer vision, computer vision task, Pedestrian Attribute Recognition, including person retrieval

备注：

点击查看摘要

Abstract:Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model's cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on this https URL

38. 【2603.05110】BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity

链接：https://arxiv.org/abs/2603.05110

作者：Iman Nematollahi,Jose Francisco Villena-Ossa,Alina Moter,Kiana Farhadyar,Gabriel Kalweit,Abhinav Valada,Toni Cathomen,Evelyn Ullrich,Maria Kalweit

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Machine learning models, Machine learning, dynamics hold promise, interaction dynamics hold, hold promise

备注：

点击查看摘要

Abstract:Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.

39. 【2603.05105】Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search

链接：https://arxiv.org/abs/2603.05105

作者：Zongfang Liu,Shengkun Tang,Zongliang Wu,Xin Yuan,Zhiqiang Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable success, remain computationally demanding, computationally demanding due, multi-step denoising process, large model sizes

备注：

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.

40. 【2603.05095】GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

链接：https://arxiv.org/abs/2603.05095

作者：Xiaodong Zhu,Yuanming Zheng,Suting Wang,Junqi Yang,Yuhong Yang,Weiping Tu,Zhongyuan Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：providing interpretable evidence, Temporal Forgery Localization, precisely identify manipulated, identify manipulated segments, Weakly Supervised TFL

备注： 10 pages, 4 figures, accepted by CVPR 2026

点击查看摘要

Abstract:Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.

41. 【2603.05093】Axiomatic On-Manifold Shapley via Optimal Generative Flows

链接：https://arxiv.org/abs/2603.05093

作者：Cenwei Zhang,Lin Zhu,Manxi Lin,Lei You

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：off-manifold artifacts due, post-hoc XAI, XAI but suffers, Shapley-based attribution, critical for post-hoc

备注： 11 figures, 22 pages

点击查看摘要

Abstract:Shapley-based attribution is critical for post-hoc XAI but suffers from off-manifold artifacts due to heuristic baselines. While generative methods attempt to address this, they often introduce geometric inefficiency and discretization drift. We propose a formal theory of on-manifold Aumann-Shapley attributions driven by optimal generative flows. We prove a representation theorem establishing the gradient line integral as the unique functional satisfying efficiency and geometric axioms, notably reparameterization invariance. To resolve path ambiguity, we select the kinetic-energy-minimizing Wasserstein-2 geodesic transporting a prior to the data distribution. This yields a canonical attribution family that recovers classical Shapley for additive models and admits provable stability bounds against flow approximation errors. By reframing baseline selection as a variational problem, our method experimentally outperforms baselines, achieving strict manifold adherence via vanishing Flow Consistency Error and superior semantic alignment characterized by Structure-Aware Total Variation. Our code is on this https URL.

42. 【2603.05081】Orthogonal Spatial-temporal Distributional Transfer for 4D Generation

链接：https://arxiv.org/abs/2603.05081

作者：Wei Liu,Shengqiong Wu,Bobo Li,Haoyu Zhao,Hao Fei,Mong-Li Lee,Wynne Hsu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：increasing research attention, garnered increasing research, AIGC era, content has garnered, garnered increasing

备注： 9 pages, 6 figures, 3 tables, AAAI

点击查看摘要

Abstract:In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.

43. 【2603.05078】MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

链接：https://arxiv.org/abs/2603.05078

作者：Juntong Fang,Zequn Chen,Weiqi Zhang,Donglin Di,Xuancheng Zhang,Chengmin Yang,Yu-Shen Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：camera pose estimation, remains challenging due, corrupt camera pose, scenes remains challenging, pose estimation

备注： Accepted by CVPR 2025. Project page: [this https URL](https://hellexf.github.io/MoRe/)

点击查看摘要

Abstract:Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

44. 【2603.05075】UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

链接：https://arxiv.org/abs/2603.05075

作者：Yanlin Li,Minghui Guo,Kaiwen Zhang,Shize Zhang,Yiran Zhao,Haodong Li,Congyue Zhou,Weijie Zheng,Yushen Yan,Shengqiong Wu,Wei Ji,Lei Cui,Furu Wei,Hao Fei,Mong-Li Lee,Wynne Hsu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：comprehend arbitrarily combined, real-world multimodal applications, interleaved multimedia form, Multimodal Large Language, interleaved multimodal inputs

备注： 70 pages, 63 figures, 30 tables, CVPR

点击查看摘要

Abstract:In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is this https URL.

45. 【2603.05071】MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration

链接：https://arxiv.org/abs/2603.05071

作者：Nian Liu,Jin Gao,Shubo Lin,Yutong Kou,Sikui Zhang,Fudong Ge,Zhiqiang Pu,Liang Li,Gang Wang,Yizheng Wang,Weiming Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Infrared small target, small target detection, low-contrast targets, challenging because tiny, dynamic backgrounds

备注： 18 pages, 6 figures

点击查看摘要

Abstract:Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at this https URL.

46. 【2603.05058】A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset

链接：https://arxiv.org/abs/2603.05058

作者：Francisco Vacalebri-Lloret(1),Lucas Banchero(1),Jose J. Lopez(1),Jose M. Mossi(1) ((1) Universitat Politècnica de València, Spain)

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：detecting blue lights, European emergency vehicles, images of European, developed using ABLDataset, European emergency

备注： 16 pages, 17 figures. Submitted to IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.

47. 【2603.05053】CLIP-driven Zero-shot Learning with Ambiguous Labels

链接：https://arxiv.org/abs/2603.05053

作者：Jinfu Fan,Jiangnan Li,Xiaowen Yan,Xiaohui Zhong,Wenpeng Lu,Linqing Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recognize unseen classes, existing methods assume, methods assume accurate, assume accurate class, accurate class labels

备注： Accepted by ICASSP 2026 (IEEE International Conference on Acoustics, Speech, and Signal Processing)

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.

48. 【2603.05042】CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

链接：https://arxiv.org/abs/2603.05042

作者：Zhaonian Kuang,Rui Ding,Haotian Wang,Xinhu Zheng,Meng Yang,Gang Hua

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：multi-sensor physical agents, attracted increasing attention, object detection, physical agents, autonomous vehicles

备注： Accepted to CVPR 2026 main track

点击查看摘要

Abstract:Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.

49. 【2603.05041】Exploiting Intermediate Reconstructions in Optical Coherence Tomography for Test-Time Adaption of Medical Image Segmentation

链接：https://arxiv.org/abs/2603.05041

作者：Thomas Pinetz,Veit Hucke,Hrvoje Bogunovic

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Primary health care, low-cost imaging devices, health care frequently, care frequently relies, Primary health

备注： Accepted at MIDL 2026

点击查看摘要

Abstract:Primary health care frequently relies on low-cost imaging devices, which are commonly used for screening purposes. To ensure accurate diagnosis, these systems depend on advanced reconstruction algorithms designed to approximate the performance of high-quality counterparts. Such algorithms typically employ iterative reconstruction methods that incorporate domain-specific prior knowledge. However, downstream task performance is generally assessed using only the final reconstructed image, thereby disregarding the informative intermediate representations generated throughout the reconstruction process. In this work, we propose IRTTA to exploit these intermediate representations at test-time by adapting the normalization-layer parameters of a frozen downstream network via a modulator network that conditions on the current reconstruction timescale. The modulator network is learned during test-time using an averaged entropy loss across all individual timesteps. Variation among the timestep-wise segmentations additionally provides uncertainty estimates at no extra cost. This approach enhances segmentation performance and enables semantically meaningful uncertainty estimation, all without modifying either the reconstruction process or the downstream model.

50. 【2603.05037】Generalizable Multiscale Segmentation of Heterogeneous Map Collections

链接：https://arxiv.org/abs/2603.05037

作者：Remi Petitpierre

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse in style, highly diverse, map, single-sheet documents, Historical

备注： 30 pages, 15 figures

点击查看摘要

Abstract:Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.

51. 【2603.05012】2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

链接：https://arxiv.org/abs/2603.05012

作者：Yulong Shi,Shijie Li,Ziyi Li,Lin Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Source Free Unsupervised, Free Unsupervised Domain, Source Free, Free Unsupervised, deploying deep learning

备注： Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)

点击查看摘要

Abstract:Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at this https URL.

52. 【2603.05010】How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

链接：https://arxiv.org/abs/2603.05010

作者：Xiang Yin,Jinfan Hu,Zhiyuan You,Kainan Yan,Yu Tang,Chao Dong,Jinjin Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：impressive perceptual realism, achieved impressive perceptual, Generative Image Restoration, achieved impressive, practical capabilities

备注：

点击查看摘要

Abstract:Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

53. 【2603.04999】Physics-consistent deep learning for blind aberration recovery in mobile optics

链接：https://arxiv.org/abs/2603.04999

作者：Kartik Jhawar,Tamo Sancho Miguel Tandoc,Khoo Jun Xuan,Wang Lipo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：lens-specific optical aberrations, limited by complex, deep learning, lens-specific optical, optical aberrations

备注： 4 pages, 3 figures

点击查看摘要

Abstract:Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these "black-box" models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.

54. 【2603.04993】MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration

链接：https://arxiv.org/abs/2603.04993

作者：Nanjie Yao,Gangjian Zhang,Wenhao Shen,Jian Shu,Yu Feng,Hao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：single image, complete and realistic, clothed human reconstruction, human reconstruction aims, clothed human

备注：

点击查看摘要

Abstract:Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.

55. 【2603.04989】APFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

链接：https://arxiv.org/abs/2603.04989

作者：Jiaxiong Liu,Zhen Tan,Jinpu Zhang,Yi Zhou,Hui Shen,Xieyuanli Chen,Dewen Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring high precision, long-term motion reasoning, computer vision, requiring high, fundamental yet challenging

备注：

点击查看摘要

Abstract:Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: this http URL

56. 【2603.04980】A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

链接：https://arxiv.org/abs/2603.04980

作者：Jie Zhu,Hanghang Ma,Jia Wang,Yayong Guan,Yanbing Zeng,Lishuai Gao,Junqiang Wu,Jie Hu,Leye Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：leverages next-token prediction, simple autoregressive baseline, unify multi-modal understanding, baseline that leverages, leverages next-token

备注： Technical report. This work serves as a straightforward autoregressive baseline for unifying understanding, generation, and editing

点击查看摘要

Abstract:In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at this https URL.

57. 【2603.04977】hink, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

链接：https://arxiv.org/abs/2603.04977

作者：Zheng Wang,Haoran Chen,Haoxuan Qin,Zhipeng Wei,Tianwen Qian,Cong Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long-range temporal dependencies, dense visual redundancy, accumulate semantic drift, Long video understanding, visual redundancy

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: this https URL.

58. 【2603.04976】3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

链接：https://arxiv.org/abs/2603.04976

作者：Xiongkun Linghu,Jiangyong Huang,Baoxiong Jia,Siyuan Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, understanding remains under-explored, Reinforcement Learning, Multi-modal Large Language, scene understanding remains

备注： Project page: [this https URL](https://3d-rft.github.io/)

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

59. 【2603.04975】BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

链接：https://arxiv.org/abs/2603.04975

作者：Zishu Yao,Xiang-Xiang Su,Shengning Zhou,Guang-Yong Chen,Guodong Fan,Xing Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high dynamic range, show great promise, Low-light Image Enhancement, promise for Low-light, Low-light Image

备注：

点击查看摘要

Abstract:Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage-which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective-we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SDE demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR, 2.03dB in PSNR* and 0.047 in SSIM, respectively. The code will be publicly available at this https URL.

60. 【2603.04958】Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression

链接：https://arxiv.org/abs/2603.04958

作者：Toby Chong,Ryota Nakajima

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：perspective distortion effect, distortion effect commonly, close-up facial images, Morphable Model, effectively captures

备注： WACV 2026, [this https URL](https://zukunfcs.github.io/RevisitingAnOldPerspective/)

点击查看摘要

Abstract:We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.

Comments:
WACV 2026, this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Cite as:
arXiv:2603.04958 [cs.CV]

(or
arXiv:2603.04958v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.04958

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

61. 【2603.04957】VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

链接：https://arxiv.org/abs/2603.04957

作者：Jiaxin Fan,Wenpo Song

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：existing approaches rely, Large Multimodal Models, Large Multimodal, generate detailed image, achieved strong performance

备注：

点击查看摘要

62. 【2603.04950】Location-Aware Pretraining for Medical Difference Visual Question Answering

链接：https://arxiv.org/abs/2603.04950

作者：Denis Musinguzi,Caren Han,Prasenjit Mitra

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Unlike conventional single-image, comparative diagnostic workflow, Unlike conventional, frameworks process multiple, conventional single-image models

备注： 11 pages

点击查看摘要

Abstract:Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.

63. 【2603.04949】meWarp: Evaluating Web Agents by Revisiting the Past

链接：https://arxiv.org/abs/2603.04949

作者：Md Farhan Ishmam,Kenneth Marino

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：current benchmarks raises, today agents perform, raises the question, web, current benchmarks

备注：

点击查看摘要

64. 【2603.04947】Adaptive Prototype-based Interpretable Grading of Prostate Cancer

链接：https://arxiv.org/abs/2603.04947

作者：Riddhasree Bhattacharyya,Pallabi Dutta,Sushmita Mitra

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：frequently diagnosed malignancy, malignancy in men, Prostate cancer, frequently diagnosed, diagnosed malignancy

备注：

点击查看摘要

Abstract:Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.

65. 【2603.04938】Person Detection and Tracking from an Overhead Crane LiDAR

链接：https://arxiv.org/abs/2603.04938

作者：Nilusha Jayawickrama,Henrik Toikka,Risto Ojala

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：industrial indoor workspace, paper investigates person, paper investigates, industrial indoor, indoor workspace

备注： 8 pages, 7 figures, 4 tables. Submitted to Ubiquitous Robots (UR) 2026. Code: [this https URL](https://github.com/nilushacj/O-LiPeDeT-Overhead-LiDAR-Person-Detection-and-Tracking)

点击查看摘要

Abstract:This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research

66. 【2603.04913】Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

链接：https://arxiv.org/abs/2603.04913

作者：Chanmi Lee,Minsung Yoon,Woojae Kim,Sebin Lee,Sung-eui Yoon

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Neural network-based visuomotor, network-based visuomotor policies, visuomotor policies enable, perform manipulation tasks, Neural network-based

备注： 8 pages, 10 figures, Accepted to ICRA 2026. Project page: [this https URL](https://chan-mi-lee.github.io/3DAdvObj/)

点击查看摘要

Abstract:Neural network-based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying camera-object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.

67. 【2603.04908】AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

链接：https://arxiv.org/abs/2603.04908

作者：Li'an Zhong,Ziqiang He,Jibin Zheng,Jin Li,Z. Jane Wang,Xiangui Kang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, current Large Vision-Language, Vision-Language Models, current Large, Large Vision-Language

备注：

点击查看摘要

Abstract:Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

68. 【2603.04899】FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation

链接：https://arxiv.org/abs/2603.04899

作者：Ganggui Ding,Hao Chen,Xiaogang Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large pre-trained video, intrinsic generative priors, limiting detail preservation, diffusion models excel, Large pre-trained

备注： ICASSP2026

点击查看摘要

Abstract:Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting $4\times$x and $8\times$ interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at $2560\times 1440$resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.

69. 【2603.04892】Locality-Attending Vision Transformer

链接：https://arxiv.org/abs/2603.04892

作者：Sina Hajimiri,Farzad Beizaee,Fereshteh Shakeri,Christian Desrosiers,Ismail Ben Ayed,Jose Dolz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：capture long-range dependencies, demonstrated remarkable success, long-range dependencies, demonstrated remarkable, remarkable success

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at this https URL.

70. 【2603.04890】FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation

链接：https://arxiv.org/abs/2603.04890

作者：Min Tan,Junchao Ma,Yinfu Feng,Jiajun Ding,Wenwen Pan,Tingting Han,Qian Zheng,Zhenzhong Kuang,Zhou Yu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Federated Learning, Multimodal Federated, complementary cross-modal information, leverages complementary cross-modal, collaboratively train models

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.

71. 【2603.04887】Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

链接：https://arxiv.org/abs/2603.04887

作者：Hong Liu,Dong Wei,Qian Dai,Xian Wu,Yefeng Zheng,Liansheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image analysis, considered intramodal heterogeneity, existing federated learning, multimodal imaging applications, limiting their applicability

备注： Medical Image Analysis 2025. arXiv admin note: substantial text overlap with [arXiv:2403.11803](https://arxiv.org/abs/2403.11803)

点击查看摘要

Abstract:Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.

72. 【2603.04882】DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

链接：https://arxiv.org/abs/2603.04882

作者：Xiaodong Zhu,Suting Wang,Yuanming Zheng,Junqi Yang,Yangxu Liao,Yuhong Yang,Weiping Tu,Zhongyuan Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：Temporal Forgery Localization, offering strong interpretability, precisely identify manipulated, identify manipulated segments, Forgery Localization

备注： 9 pages, 4 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.

73. 【2603.04878】Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

链接：https://arxiv.org/abs/2603.04878

作者：Hong Liu,Dong Wei,Qiong Peng,Yawen Huang,Xian Wu,Yefeng Zheng,Liansheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Computed Tomography Report, Tomography Report Generation, Computed Tomography, facilitating patient care, X-ray report generation

备注： Accept to IPMI 2025

点击查看摘要

Abstract:Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report this http URL extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

74. 【2603.04874】Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics

链接：https://arxiv.org/abs/2603.04874

作者：Jerrin Bright,Michelle Lu,John Zelek

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：upcoming pitch, pitcher body reveal, Abstract, pitcher body, pitch

备注： Submitted to CVPRW'26

点击查看摘要

Abstract:How much can a pitcher's body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4\% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9\% of the predictive signal versus 35.1\% for the lower body, with wrist position (14.8\%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80\% and delineating where kinematic information ends and ball-flight information begins.

75. 【2603.04870】Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

链接：https://arxiv.org/abs/2603.04870

作者：Jaekyun Ko,Dongjin Kim,Soomin Lee,Guanghui Wang,Tae Hyun Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sRGB image space, space is challenging, challenging due, realistic noisy images, Denoising

备注： CVPR 2026

点击查看摘要

Abstract:Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

76. 【2603.04869】SURE: Semi-dense Uncertainty-REfined Feature Matching

链接：https://arxiv.org/abs/2603.04869

作者：Sicheng Li,Zaiwang Gu,Jie Zhang,Qing Guo,Xudong Jiang,Jun Cheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Establishing reliable image, robotic vision problems, Establishing reliable, vision problems, reliable image correspondences

备注： Accepted by ICRA 2026

点击查看摘要

Abstract:Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on this https URL.

77. 【2603.04864】Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video

链接：https://arxiv.org/abs/2603.04864

作者：Jerrin Bright,Justin Mende,John Zelek

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precise biomechanical signals, stadium-installed multi-camera systems, biomechanical signals, stadium-installed multi-camera, depends on precise

备注： Submitted to CVPRW'26

点击查看摘要

Abstract:Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $ 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.

78. 【2603.04852】On Multi-Step Theorem Prediction via Non-Parametric Structural Priors

链接：https://arxiv.org/abs/2603.04852

作者：Junbo Zhao,Ting Zhang,Can Li,Wei He,Jingdong Wang,Hua Huang

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-step theorem prediction, Multi-step theorem, central challenge, challenge in automated, theorem prediction

备注：

点击查看摘要

Abstract:Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

79. 【2603.04847】GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

链接：https://arxiv.org/abs/2603.04847

作者：Tianyu Xiong,Rui Li,Linjie Li,Jiaqi Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：independent optimization objectives, structure from motion, Gaussian Splatting training, view synthesis, NVS

备注：

点击查看摘要

Abstract:Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

80. 【2603.04846】Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

链接：https://arxiv.org/abs/2603.04846

作者：Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multi-Modal Large Language, advanced downstream applications, Large Language Models, significantly advanced downstream, Multi-Modal Large

备注： Accepted by CVPR2026

点击查看摘要

Abstract:The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at this https URL.

81. 【2603.04839】owards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

链接：https://arxiv.org/abs/2603.04839

作者：Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision-language pre-training, critical concern, rapid advancement, advancement and widespread, widespread application

备注： Accepted by CVPR2026

点击查看摘要

Abstract:With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at this https URL.

82. 【2603.04825】Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

链接：https://arxiv.org/abs/2603.04825

作者：Rui Zhao,Bin Shi,Kai Sun,Bo Dong

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Partial label learning, supervised classification task, prominent weakly supervised, weakly supervised classification, Partial label

备注： Accepted to CVPR2026

点击查看摘要

Abstract:Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at this https URL.

83. 【2603.04817】Revisiting Shape from Polarization in the Era of Vision Foundation Models

链接：https://arxiv.org/abs/2603.04817

作者：Chenhao Li,Taishi Ono,Takeshi Uemori,Yusuke Moriuchi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：surface normal estimation, single-shot object-level surface, object-level surface normal, RGB-only vision foundation, vision foundation models

备注：

点击查看摘要

Abstract:We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.

84. 【2603.04811】Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation

链接：https://arxiv.org/abs/2603.04811

作者：SangHyuk Kim,Daniel Haehn,Sumientra Rampersad

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：explicitly leverages categorical, leverages categorical scanner, categorical scanner metadata, MRI sequence, present Meta-D

备注： 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.

85. 【2603.04803】Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

链接：https://arxiv.org/abs/2603.04803

作者：Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Ruochen Cui,Xilin Zhao,Qingming Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Contrastive Language-Image Pre-training, Detail Perceptual Ability, limited understanding capacity, Language-Image Pre-training, Discriminative Ability

备注：

点击查看摘要

Abstract:The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at this https URL.

86. 【2603.04800】MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

链接：https://arxiv.org/abs/2603.04800

作者：Lulu Hu,Wenhu Xiao,Xin Chen,Xinhua Xu,Bowen Xu,Kun Li,Yongliang Tao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: this https URL.

87. 【2603.04796】Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper

链接：https://arxiv.org/abs/2603.04796

作者：Kiranmayee Janardhan,Vinay Martin DSa Prabhu,T. Christy Bobby

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：improving patient outcomes, precise treatment planning, aiding in precise, patient outcomes, improving patient

备注： 22 pages, 4 Figures

点击查看摘要

Abstract:Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.

88. 【2603.04795】LAW ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation

链接：https://arxiv.org/abs/2603.04795

作者：Anugunj Naman,Ayushman Singh,Gaibo Zhang,Yaguang Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Medical image analysis, image analysis relies, Medical image, controllable synthesis, image analysis

备注：

点击查看摘要

Abstract:Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.

89. 【2603.04793】RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery

链接：https://arxiv.org/abs/2603.04793

作者：Huiran Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：non-adaptive receptive field, receptive field utilization, remote sensing imagery, inadequate long-range multi-scale, Rotated object detection

备注：

点击查看摘要

Abstract:Rotated object detection in remote sensing imagery is hindered by three major bottlenecks: non-adaptive receptive field utilization, inadequate long-range multi-scale feature fusion, and discontinuities in angle regression. To address these issues, we propose Rotated Multi-Kernel RetinaNet (RMK RetinaNet). First, we design a Multi-Scale Kernel (MSK) Block to strengthen adaptive multi-scale feature extraction. Second, we incorporate a Multi-Directional Contextual Anchor Attention (MDCAA) mechanism into the feature pyramid to enhance contextual modeling across scales and orientations. Third, we introduce a Bottom-up Path to preserve fine-grained spatial details that are often degraded during downsampling. Finally, we develop an Euler Angle Encoding Module (EAEM) to enable continuous and stable angle regression. Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD show that RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.

90. 【2603.04775】Privacy-Aware Camera 2.0 Technical Report

链接：https://arxiv.org/abs/2603.04775

作者：Huan Song,Shuyu Tian,Ting Long,Jiang Liu,Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：profound privacy-security paradox, intelligent sensing technologies, highly sensitive environments, surveillance systems face, visual surveillance systems

备注：

点击查看摘要

91. 【2603.04771】MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement

链接：https://arxiv.org/abs/2603.04771

作者：Linda Wei,Chang Liu,Wenran Zhang,Yuxuan Hu,Ruiyang Li,Feng Qi,Changyao Tian,Ke Wang,Yuanyuan Wang,Shaoting Zhang,Dimitris Metaxas,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：dental crown design, common treatment modalities, Dental crown restoration, personalized dental crown, Dental crown

备注：

点击查看摘要

Abstract:Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction. To address these limitations, we propose \totalframework, a margin-aware mesh generation framework comprising CrownDeformR and CrownSegger. Inspired by the clinic manual workflow of dental crown design, we designed CrownDeformR to deform an initial template to the target crown based on anatomical context, which is extracted by a multi-scale intraoral scan encoder. Additionally, we introduced \marginseg, a novel margin segmentation network, to extract the cervical margin of the target tooth. The performance of CrownDeformR improved with the cervical margin as an extra constraint. And it was also utilized as the boundary condition for the tailored postprocessing method, which removed the overextended area of the reconstructed surface. We constructed a large-scale intraoral scan dataset and performed extensive experiments. The proposed method significantly outperformed existing approaches in both geometric accuracy and clinical feasibility.

92. 【2603.04770】DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction

链接：https://arxiv.org/abs/2603.04770

作者：Shiyu Zhang,Zhicong Wu,Huangxuan Zhao,Zhentao Liu,Lei Chen,Yong Luo,Lefei Zhang,Zhiming Cui,Ziwen Ke,Bo Du

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Digital subtraction angiography, key imaging technique, Digital subtraction, subtraction angiography, cerebrovascular diseases

备注： 11 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.

93. 【2603.04766】Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition

链接：https://arxiv.org/abs/2603.04766

作者：Feng Liu,Bingyu Nan,Xuezhong Qian,Xiaolan Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词：Differential Selection Strategy, Existing manual labeling, Global Anti-Monotonic Differential, Anti-Monotonic Differential Selection, manual labeling

备注： 15 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Existing manual labeling of micro-expressions is subject to errors in accuracy, especially in cross-cultural scenarios where deviation in labeling of key frames is more prominent. To address this issue, this paper presents a novel Global Anti-Monotonic Differential Selection Strategy (GAMDSS) architecture for enhancing the effectiveness of spatio-temporal modeling of micro-expressions through keyframe re-selection. Specifically, the method identifies Onset and Apex frames, which are characterized by significant micro-expression variation, from complete micro-expression action sequences via a dynamic frame reselection mechanism. It then uses these to determine Offset frames and construct a rich spatio-temporal dynamic representation. A two-branch structure with shared parameters is then used to efficiently extract spatio-temporal features. Extensive experiments are conducted on seven widely recognized micro-expression datasets. The results demonstrate that GAMDSS effectively reduces subjective errors caused by human factors in multicultural datasets such as SAMM and 4DME. Furthermore, quantitative analyses confirm that offset-frame annotations in multicultural datasets are more uncertain, providing theoretical justification for standardizing micro-expression annotations. These findings directly support our argument for reconsidering the validity and generalizability of dataset annotation paradigms. Notably, this design can be integrated into existing models without increasing the number of parameters, offering a new approach to enhancing micro-expression recognition performance. The source code is available on GitHub[this https URL].

94. 【2603.04763】Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

链接：https://arxiv.org/abs/2603.04763

作者：Alexandru Florea,Shansong Wang,Mingzhe Hu,Qiang Li,Zach Eidex,Luke del Balzo,Mojtaba Safari,Xiaofeng Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：task-specific artificial intelligence, raises fundamental questions, diagnosis demands synthesis, ambiguous patient narratives, general-purpose foundation models

备注：

点击查看摘要

Abstract:The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.

95. 【2603.04745】oward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

链接：https://arxiv.org/abs/2603.04745

作者：Yang Zou,Jun Ma,Zhidong Jiao,Xingyuan Li,Zhiying Jiang,Jinyuan Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rarely addressed task, Infrared image super-resolution, addressed task, practically significant, significant yet rarely

备注： This paper was accepted by CVPR 2026

点击查看摘要

Abstract:Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: this https URL.

96. 【2603.04733】FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

链接：https://arxiv.org/abs/2603.04733

作者：Xingyu Wang,Tao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling deep learning, deep learning models, data distribution shifts, handle real-world data, real-world data distribution

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.

97. 【2603.04727】Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

链接：https://arxiv.org/abs/2603.04727

作者：Shanle Yao,Armin Danesh Pazho,Narges Rashvand,Hamed Tabkhi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multimodal large language, demonstrated impressive general, impressive general competence, Video Anomaly Detection, Multimodal large

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

98. 【2603.04720】A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification

链接：https://arxiv.org/abs/2603.04720

作者：Sai Shi

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：learn complex patterns, achieved strong performance, image classification tasks, classification tasks due, high-dimensional data

备注： 18 pages, 5 figures

点击查看摘要

Abstract:Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.

99. 【2603.04676】Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

链接：https://arxiv.org/abs/2603.04676

作者：Chenjun Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Multi-image reasoning remains, remains a significant, significant challenge, challenge for vision-language, reasoning remains

备注： 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

100. 【2603.04673】sFRC for assessing hallucinations in medical image restoration

链接：https://arxiv.org/abs/2603.04673

作者：Prabhat Kc,Rongping Zeng,Nirmal Soni,Aldo Badano

类目：Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph); Machine Learning (stat.ML)

关键词：Deep learning, images from sparse-view, Fourier Ring Correlation, explored to restore, restore images

备注： 16 pages; 14 figures; 1 Supplemental document. TechRxiv Preprints, 2025

点击查看摘要

Abstract:Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.

101. 【2603.04670】Using Vision + Language Models to Predict Item Difficulty

链接：https://arxiv.org/abs/2603.04670

作者：Samin Khan

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：data visualization literacy, large language models, visualization literacy test, project investigates, investigates the capabilities

备注：

点击查看摘要

102. 【2603.04638】Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI

链接：https://arxiv.org/abs/2603.04638

作者：Prathamesh Pradeep Khole,Mario M. Brenes,Zahra Kais Petiwala,Ehsan Mirafzali,Utkarsh Gupta,Jing-Rebecca Li,Andrada Ianus,Razvan Marinescu

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词：assume impermeable boundaries, estimate voxel-level parameters, recovering explicit interfaces, Diffusion MRI, assume impermeable

备注： 10 Pages, 5 Figures, 2 Tables

点击查看摘要

Abstract:Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.

103. 【2603.04614】SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D

链接：https://arxiv.org/abs/2603.04614

作者：Zirui Wang,Ruiping Liu,Yufan Chen,Junwei Zheng,Weijia Fan,Kunyu Peng,Di Wen,Jiale Wei,Jiaming Zhang,Rainer Stiefelhagen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enabling high-level interpretation, remaining intuitively understandable, scene graphs provide, enabling high-level, understandable to humans

备注：

点击查看摘要

Abstract:3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.

104. 【2603.04598】PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

链接：https://arxiv.org/abs/2603.04598

作者：Rohan Mahadev,Joyce Yuan,Patrick Poirson,David Xue,Hao-Yu Wu,Dmitry Kislyuk

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：false positive avoidance, evaluate false positive, Composed Image Retrieval, single ground-truth answers, made significant progress

备注： Accepted for CVPR 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

105. 【2603.04568】Mask-aware inference with State-Space Models

链接：https://arxiv.org/abs/2603.04568

作者：Ignasi Mas,Ramon Morros,Javier-Ruiz Hidalgo,Ivan Huerta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, real-world computer vision, State Space Models, real-world computer, handle inputs

备注：

点击查看摘要

Abstract:Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.

106. 【2603.04565】Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

链接：https://arxiv.org/abs/2603.04565

作者：Xuan Xu,Prateek Prasanna

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：data augmentation, image synthesis plays, tumor microenvironments, plays an important, important role

备注：

点击查看摘要

Abstract:Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2603.04565 [cs.CV]

(or
arXiv:2603.04565v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.04565

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

107. 【2603.04562】Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

链接：https://arxiv.org/abs/2603.04562

作者：Ancymol Thomas,Jaya Sreevalsan-Nair

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Local Climate Zones, Local Climate, Climate Zones, study urban structures, give a zoning

备注： 25 pages, 12 figures

点击查看摘要

Abstract:Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6\%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at this https URL

108. 【2603.04538】InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities

链接：https://arxiv.org/abs/2603.04538

作者：Chengshuai Yang,Xin Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：compressive imaging systems, deployed compressive imaging, assumed forward operator, forward operator deviates, existing benchmark quantifies

备注： Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities

点击查看摘要

Abstract:State-of-the-art EfficientSCI loses 20.58 dB when its assumed forward operator deviates from physical reality in just eight parameters, yet no existing benchmark quantifies operator mismatch, the default condition in deployed compressive imaging systems. We introduce InverseNet, the first cross-modality benchmark for operator mismatch, spanning CASSI, CACTI, and single-pixel cameras. Evaluating 12 methods under a four-scenario protocol (ideal, mismatched, oracle-corrected, blind calibration) across 27 simulated scenes and 9 real hardware captures, we find: (1) deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines; (2) performance and robustness are inversely correlated across modalities (Spearman r_s = -0.71, p 0.01); (3) mask-oblivious architectures recover 0% of mismatch losses regardless of calibration quality, while operator-conditioned methods recover 41-90%; (4) blind grid-search calibration recovers 85-100% of the oracle bound without ground truth. Real hardware experiments confirm that simulation trends transfer to physical data. Code will be released upon acceptance.

109. 【2603.04509】Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living

链接：https://arxiv.org/abs/2603.04509

作者：Kooshan Hashemifard,Pau Climent-Pérez,Francisco Florez-Revuelta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：effective Ambient Assisted, Ambient Assisted Living, Ambient Assisted, effective Ambient, Assisted Living

备注：

点击查看摘要

Abstract:Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.

110. 【2603.04448】SkillNet: Create, Evaluate, and Connect AI Skills

链接：https://arxiv.org/abs/2603.04448

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：execute complex tasks, flexibly invoke tools, complex tasks, flexibly invoke, invoke tools

备注： [this http URL](http://skillnet.openkg.cn/)

点击查看摘要

111. 【2603.04415】he Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

链接：https://arxiv.org/abs/2603.04415

作者：Ruobing Zheng,Tianqi Li,Jianing Li,Qingpei Guo,Yi Yuan,Jingdong Chen

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：reasoning-enhanced Large Language, Large Language Models, scenarios remains uncertain, demonstrated remarkable advances, Large Language

备注： Project Page: [this https URL](https://digital-avatar.github.io/ai/ThinkingBoundary/)

点击查看摘要

112. 【2603.04405】Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology

链接：https://arxiv.org/abs/2603.04405

作者：Ekansh Arora

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：AUC, Foundation models, computational pathology, increasingly applied, applied to computational

备注： 27 pages, 6 figures, 7 tables. Code and data available at [this https URL](https://github.com/ekansh-arora0/cross-species-pathology)

点击查看摘要

Abstract:Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.

113. 【2603.05247】ICHOR: A Robust Representation Learning Approach for ASL CBF Maps with Self-Supervised Masked Autoencoders

链接：https://arxiv.org/abs/2603.05247

作者：Xavier Beltran-Urbano,Yiran Li,Xinglin Zeng,Katie R. Jobson,Manuel Taso,Christopher A. Brown,David A. Wolk,Corey T. McMillan,Ilya M. Nashrallah,Paul A. Yushkevich,Ze Wang,John A. Detre,Sudipto Dolui

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词：Arterial spin labeling, cerebral blood flow, enabling noninvasive measurements, regional cerebral blood, Arterial spin

备注：

点击查看摘要

Abstract:Arterial spin labeling (ASL) perfusion MRI allows direct quantification of regional cerebral blood flow (CBF) without exogenous contrast, enabling noninvasive measurements that can be repeated without constraints imposed by contrast injection. ASL is increasingly acquired in research studies and clinical MRI protocols. Building on successes in structural imaging, recent efforts have implemented deep learning based methods to improve image quality, enable automated quality control, and derive robust quantitative and predictive biomarkers with ASL derived CBF. However, progress has been limited by variable image quality, substantial inter-site, vendor and protocol differences, and limited availability of labeled datasets needed to train models that generalize across cohorts. To address these challenges, we introduce ICHOR, a self supervised pre-training approach for ASL CBF maps that learns transferable representations using 3D masked autoencoders. ICHOR is pretrained via masked image modeling using a Vision Transformer backbone and can be used as a general-purpose encoder for downstream ASL tasks. For pre-training, we curated one of the largest ASL datasets to date, comprising 11,405 ASL CBF scans from 14 studies spanning multiple sites and acquisition protocols. We evaluated the pre-trained ICHOR encoder on three downstream diagnostic classification tasks and one ASL CBF map quality prediction regression task. Across all evaluations, ICHOR outperformed existing neuroimaging self-supervised pre-training methods adapted to ASL. Pre-trained weights and code will be made publicly available.